Rod Waldhoff's Weblog  

Rod Waldhoff's Weblog

 Monday, 30 June 2003 
Like Clover? Check out JCoverage #

Like Clover, JCoverage is a code coverage analyzer for Java. "Instrument" your code with either of these tools, run your unit test suite (really, execute the code in any way), and one can generate a report on what was executed (lines, methods, branches, etc.), and more importantly, what wasn't.

Unlike Clover:

  • JCoverage is GPL'ed
  • JCoverage instruments the byte code (via BCEL) rather than the source, which seems substantially faster, at least under casual observation.
  • JCoverage is clever enough to not instrument select lines--lines that invoke log4j for example--which means that logging calls don't pollute your coverage metrics, whether or not you run the test suite with logging on.
  • JCoverage can generate a complete, parsable coverage report in XML from which you can render custom reports or derived statistics.
  • JCoverage includes custom Ant tags for instrumentation, reporting and even asserting levels of coverage at a fine grained level.
  • JCoverage makes it easy to merge coverage databases across several runs--for unit and functional tests, for example, or to create a single report for independently built components. (Of course, given the XML report, it may be tedious but shouldn't be difficult to do this sort of merge on "manually" either.)

With JCoverage, I think I'll need to reconsider my position that coverage analysis may be too slow to execute with every continuous integration build.

 Thursday, 26 June 2003 
Re: Why Java is not Open Source #

The CTO for Sun's Desktop Division Hans Muller writes:

I think that one of the primary reasons that Java is not an open source project is that given the size of the developer community, forks are unacceptable. In other words the millions of developers who build software on top of Java value its stability more than they value the right to get under the hood and fix it.

Hrrrm. That seems like a moderately controversial statement to me, for several reasons:

  1. The community's experience with Python (source), Ruby (source), Perl (source) (and others) might be actively proving otherwise.
  2. The metaphor about cowboys and power plants, like many physical metaphors for software development, simply doesn't work. A version control system alone would alleviate this problem, as would a "gatekeeper" as Muller himself describes in the proceeding paragraphs.
  3. Why not just define a specification and a TCK, hang on to the Java trademark and make folks pay to describe their JRE as such. It's precisely the same strategy used for the J2EE brand, isn't it? Oh, now I get it.
  4. Simply adding a patch mechanism to the Bug Parade and actually accepting them once in a while would seem to be a dramatic step forward, giving Sun many of the benefits of open source (and the development community a few) with little or no risk.

I didn't and still don't expect Sun to open source Java, although not for any of the reasons Muller describes. Yet not having followed the java.net phenomenon very closely, I had took it to be a sign that Sun is getting more clueful about the role of the developer community, and of the open source developer community in particular, in the success of Java, especially relative to the Java Community Process. Perhaps I was wrong.

 Wednesday, 25 June 2003 
Experimenting with Jester #

In a previous post I alluded to the use of mutation testing to evaluate the completeness of a test suite, rather than relying upon pure test coverage metrics. In a comment to that post Adewale Oshineye suggested that I check out Ivan Moore's Jester, a Java/JUnit based mutation testing tool. I'd seen Jester before, but I'd never used it nor looked at it in much detail. The anecdote about what Jester uncovered in Bob Martin's and Robert Koss's test driven bowling score calculator example was certainly interesting, so this morning I bit the bullet and downloaded a copy.

Getting Jester up and running was a minor hassle (and it seems like much of that hassle could be alleviated), but I've written my share of open source projects with quirky configuration and installation, so I'll leave that alone. Once my files were placed in the proper position and after tweaking the python scripts to get them to run on whatever version of python I've got on my RedHat box, Jester ran slowly (waiting on javac) but well. It ran much faster once I figured out how to tell Jester to stop mutating my comments (set shouldRemoveComments=true in jester.cfg). The result was a basic HTML report like this one.

(On an unrelated note, is there something more or less equivalent to python -version?)

As an experiment, I took a small component (196 non-blank, non-comment lines of code spread over 34 methods) I knew to be well tested (100% coverage of statements and conditional expressions) and ran it through Jester. It found 21 mutations total, 2 of which didn't lead to a test failure. Those were (with the modified code in bold):

if(TRUE || MESSAGE_LOG.isDebugEnabled()) {
   MESSAGE_LOG.debug("Broadcasting " + msg);
}

and

List list = new ArrayList(_listeners.size() + 12);
list.addAll(_listeners);
list.add(...);

In the first case, believe it or not, I actually had unit tests that confirm that MESSAGE_LOG generates log events when set to the DEBUG priority, and no log events when set to higher than DEBUG priority. (I wouldn't normally do that, except this is the single log message in that component, and I really wanted 100% coverage. Besides, it wasn't that hard, I just added a mock Appender to that Category, and checked to see if a message was added or not.) Of course, both of these tests still pass, even without the isDebugEnabled call, since the debug method won't generate a log event when using a higher priority. Adding a test that fails as a result of this mutation isn't particularly useful, but it isn't difficult either--I just pass in a mock instance of the msg object and check whether or not the toString method is invoked. Not invoking toString is indeed the reason for this if(isDebugEnabled()) block, so maybe that's not such an odd test to have after all.

The second case is the kind of thing Ivan Moore describes as a "false positive" in his writeup on Jester. This code initializes that ArrayList to the precise size it knows will need. Allocating it a little bit bigger or a little bit smaller doesn't alter the functional behavior of the method (although it will be slightly less efficient). Maybe that indicates a premature optimization on my part, but it seems like a pretty small one. In any event I don't see any way to confirm that that List was allocated to precisely the right size without breaking encapsulation profoundly, so I think I'll let that one go.

I had hoped to run Jester on some larger, more complicated but less well tested code (3,287 nc,nb lines, roughly 77% coverage) to get a feel for how it works in a more useful scenario, but I've been unable to get it to complete a run on the this larger component. I may poke around with something in-between, but 4,000 lines is on the smallish side for the kinds of modules I'd want to run this on. I may have better luck mutating a single class at a time.

In short, I think Jester meets Sam Ruby's criteria for a successful open source project--it's a good idea with a bad implementation. I have some thoughts on how to improve that implementation (mainly obvious ones--e.g., use a ClassLoader and an actual parser, or consider mutating the byte code rather than the source) that maybe I'll cover in a later post. All in all, Jester is an interesting project, and like a lot of things, I wish it worked a bit better.

 Tuesday, 24 June 2003 
Testing Testing #

My fear, and perhaps it's not a well founded one, is that the pursuit of pure test coverage--the percentage of lines, statements, methods, etc. tested--will provide a false sense of completeness. If "percent executed by test code" is your sole metric, it's easy to write superficial tests that execute statements, but actually "test" very little.

For example, a couple of weeks ago I added the following test to one of our suites:

public void testStartStop() {
  AppMain app = new AppMain();
  assertFalse(app.isStarted());
  app.start();
  assertTrue(app.isStarted());
  app.stop();
  assertFalse(app.isStarted());
}

This single test--directly invoking just four distinct methods and comprising only three assertions--executed an additional 1,500 lines or so, roughly 10% of the code for this module. You can be sure that the bulk of the functionality provided by those 1,500 lines is not actually tested here. Indeed, the following (test first) implementation would suffice:

class AppMain {
  void start() {
    started = true;
  }
  void stop() {
    started = false;
  }
  boolean isStarted() {
    return started;
  }
  boolean started = false;
}

Is this test useless? At this point I'll argue it isn't. Superficial testing of these 1,500 lines is better than no testing at all. This test at least confirms that we've got the classpath right (including any resources loaded on application startup) and that there aren't any uncaught exceptions being thrown on startup. (When better tests are in place, this superficial test may indeed become useless.) But the resulting test coverage metric is certainly misleading.

One solution, probably the right one, is to simply develop the code test first. If every change to production code was made to address a failing test (or to refactor under a unchanging test suite) then one certainly wouldn't find himself in this situation. But this solution doesn't address my (our) current situation, in which we have 100,000 lines of "legacy" code and perhaps 60% of that code was developed "test last" if test at all. The lack of tests has become a drag on our ability to refactor or simply work with that code. If it was sufficient to simply tell folks to develop test first, I wouldn't be as concerned with measuring test coverage in the first place.

Mutation testing--in which we randomly change some piece of production code and check to see if our test suite detects the change--is another approach I've seen proposed before, but I'm having trouble buying into that. If we start with poorly factored code, it seems likely that a random change is going to break something, probably profoundly (like leading to an exception being thrown). Even superficial testing will detect those sort of problems.

Alternative but indirect metrics may be a better approach. Tracking the number of distinct tests or facts being asserted might give a rough idea of how robust the test suite is. To do this right I think I'd need some numbers from well tested portions of the code base that correlate size or complexity metrics with the number of tests or assertions in the test suite. These might be interesting numbers to collect.

Re: Liskov's Substitution Principle and JUnit Testing #
The author of Manageability has discovered the technique of inheriting test cases. This testing strategy is extremely useful when testing multiple implementations of some interface, and is actually fairly common. In fact, there's not one but two (if not more) general purpose libraries supporting this approach. I distinctly remember seeing this strategy written up in pattern form. You can find similiar write-ups here and here, but neither of those are the one I'm thinking of.

 Monday, 23 June 2003 
A Frog Boiling Approach to Increasing Test Coverage #

Some thinking out loud about concrete goals for increasing test coverage.

The production (i.e., non test) Java code base at my day job consists of roughly 103,000 non-comment, non-blank lines of code, split over 132 "modules". Our automated unit test suite exercises roughly 29% of those lines.

(That 29% figure sounds a little bit worse than it feels to me. Some areas of the code base are well tested, several are even at 100% coverage. Others have few if any tests, but as a result of remaining essentially untouched since before a formal unit testing initiative was launched 30 months or so ago. Yet many modules are woefully undertested, and probably not coincidentally several of those have substantial bloat--the number of lines in those modules is way out of proportion with the functionality they provide.)

Whether that 29% figure is indicative or not, it's clearly much lower than desirable. (Personally I've been striving for and generally achieving 100% coverage for new development.) I've been thinking a bit about laying out concrete goals for increasing this coverage.

We've talked a bit about simply targeting some figure, say 80% coverage, and perhaps some intermediate goals (for example, 40%, then 60%, then 80%).

While the "frog boiling" approach--slowly raising the temperature on test coverage--appeals to me, something about the arbitrary "percent coverage" goals doesn't seem right. I think I'd prefer goals that call for complete (100%) coverage of something, perhaps with different values of "something". I'm not entirely sure why.

Specifically, I'm thinking of the following stages:

  1. All modules have tests. Conveniently, this can be easily and quickly tested at build time. We can make the absence of tests a build failure. It is possible to programmatically evaluate the remaining goals, but not as quickly. We'd have to execute the test coverage check on every continuous integration build, something that may take too much time.
  2. All packages have tests. This should be relatively easy to achieve once the first goal is reached.
  3. All classes have tests.
  4. All methods have tests.

I'm not sure where to go from that point. Complete coverage of conditionals (every boolean expression is evaluated at least once to true and at least once to false) may be a good next step, but isn't as pithy as the other goals. It may be that once we've reached 100% method coverage, 100% line/statement coverage is within easy reach. I suspect that once we've reached that fourth goal, the next step will be pretty clear.

I wonder if any reader has some experience with similar strategies for increasing test coverage through a series of concrete goals. What goals did you select, and why? Did a given step turn out to be too large or too small?

 Tuesday, 17 June 2003 
the fixmeister role, or ''you, help me'' #

In some previous posts, I discussed some of the challenges we've experienced maintaining a continuous integration discipline at my day job. In my last post on this topic, I alluded to "additional measures" which we adopted. Time to describe what I was talking about.

Although I hadn't seen it at the time of the events described below, I recently ran across a comment on Ward's Wiki that accurately captures the problem we encountered and its solution:

There was an interesting psych study along these lines. There is one "victim" and several "bystanders". Each of the bystanders individually would be perfectly capable of helping the victim, but since they NeverVolunteer, none of them do. This happens in real life situations. If you are the victim and want help, what you need to do is to pick one person from the bystanders and say to them specifically "you help me". If you just do "somebody help me" it won't work. -- AndyPierce [from Never Volunteer]

Toward the end of April, in the midst of one of the longer low points in our development cycle, I sent the following (slightly edited) email to my peers on the technology management team. At the time it felt a bit like a failure.

Subject: build failures, and what to do about them
Sent: Wed 4/23/2003 10:29 AM

There was a time, not so long ago, when we would regularly see a dozen or more good integration builds per day. In fact the vast majority of builds were clean ones. (See [internal link to an historical report on continuous integration builds].) Recently (meaning the past few months) we're lucky to see a dozen good builds in a week.

As we've seen rather directly the past couple of weeks, our inability to regularly integrate changes across the entire code base slows our development process, delays production releases and may threaten our ability to deliver products according to schedule. Moreover, integration problems compound themselves. When we go several hours (let alone several days) without a clean build, it's no longer just one problem we need to fix, but several problems that are hidden behind the first one, masked because problems in some dependent library stops the build before it discovers the later problems.

These integration problems are real problems. A "broken build" means that there is some code that either doesn't work at all, doesn't work in relation to other libraries, or doesn't work outside a particular developer's (or developers') sandbox. These problems may not impact every developer at all times, but it is quite likely they impact some developer. It is guaranteed that they will impact the entire technology team and for that matter the entire enterprise when it comes time to release a product (and as I understand it, we have a few of those queued up in the near future). By and large, the way in which developers avoid this problem is to simply not update their local repositories. That's a false sense of progress if there ever was one.

Some developers have complained about this problem. Some managers have as well. We've made increasingly obnoxious efforts ([internal link to our nag servlet, as described in a previous entry]) to increase the visibility of this problem, but it's not getting better. And due to the compounding behavior described above, when it's not getting better, it's getting worse.

I propose we do the following: For each series of build failures (that is, for every continuous sequence of failures, no matter how many underlying causes they might have) we name one arbitrarily selected developer to be "on point" for achieving a clean build. Achieving a clean build becomes this developer's top priority (if they need to weasel out of it in order to something that is "more important", they do that by finding someone else willing to trade slots with them). In exchange for this responsibility, we give this developer the authority to grab whomever then need to do whatever they need to accomplish this goal. If Alice, being on point, needs Jose's help to diagnose a problem, or needs Jose to make changes to some library in order to bring said library into sync, then this becomes Jose's top priority as well (see the "weaseling out of it" strategy above). Similarly, if Alice, being on point, needs Jose to take over some task Alice would otherwise be working on, she is empowered to do that as well.

Why?

1) Having a code base that is not integrated is costing us a substantial amount of developer time, which means it's costing the company a substantial amount of money and threatening release schedules.

2) Things that are everyone's responsibility become no one's responsibility. Actually, this isn't really it. More accurately, when we fail to send a message that things that are everyone's responsibility are important, those things become the responsibility of a handful of truly responsible people. There are developers who take build failures seriously, and make a well above average effort to address them. These efforts are undermined by those who don't take build failures seriously (or insufficiently seriously), and eventually, the patience of those good Samaritans wears thin. I know mine did several weeks ago and I've notice a degradation in the integration quality since that time. It was my hope that others would step up to take greater responsibility for integration builds, some did, but seemingly not enough.

[It may seem that I'm overestimating my contribution there, but I literally, personally, addressed perhaps 30% of the build failures for several months. I looked at each and every build failure, and when it seemed no one was making progress in addressing it, I'd fix it myself. Frankly, I did this because (a) when our CI process was first initiated, there were a number of detractors that claimed this simply wouldn't work, and (b) it seems like this simply the right approach to a CI process--I'd expect everyone to do more or less the same thing.]

3) Making fixing the build a top priority sends the message that we take continuous integration seriously, as a management team, as a development team, and as an enterprise.

4) Selecting an arbitrary developer to be on point offers a number of advantages:

a) It increases the visibility of build issues, the kinds of problems that cause them, and an effective strategy for resolving them, across the whole team. Folks who may have never before thought about how to diagnose an arbitrary problem reported by the CI server are forced to do so. Folks who may have never reflected on the kinds of changes that are problematic, or how to make changes in an incremental and backwards compatible way are exposed to those sorts of issues and those sorts of solutions.

b) It applies stronger peer pressure. For quite some time I would personally "nag" folks to fix build issues, even sit down with them to explain the nature or the problem and the needed fix. After a while, this nagging decreases in effectiveness, since "Rod's always complaining about the build". When several different folks in a week are nagging you to fix problems you've caused, the pressure not to cause such problems (or to fix them on you own initiative) is increased. When several different folks find themselves nagging the same person over and over again, the pressure on that person to improve their personal process is increased.

c) It causes greater knowledge sharing. When put on point to fix a problem in some otherwise unknown library, a developer is forced to learn something about it, probably by pairing with someone more knowledgeable about it. When a developer finds that they are consistently being "bothered" to update a library because they are the only ones that understand it, the pressure to make it more understandable or to share some of that understanding is increased.

d) Indirectly, it sends the message that fixing the build really is everyone's responsibility. The arbitrary developer "on point" is simply putting a face on the team's concerns.

I'd very much like to start putting people "on point" for fixing the build today, probably by calling an all developer meeting to explain this protocol. I'd also very much like to not have this protocol undermined by some direct or indirect message of "well, that's important, but this is more important" (where "this" is some more urgent, but not necessarily more important task). We have more than enough developers to manage both the urgent and important tasks on our plate, and I'll argue that an approach that doesn't make integration a top if not the top priority is going to cost us in the long run. If anyone has objections or an alternative strategy, I'd love to hear about it.

This strategy was well received by the management team, and was implemented almost immediately. This strategy was less well received by some members of the development team, but there was at least grudging acceptance from everyone.

The team has come to call this role the "fixmeister" (a name which I'm personally not very fond of). The first fixmeister had the excellent suggestion that each subsequent fixmeister be selected by the current one, which has led to an interesting variety of selection algorithms, some whimsical, some instructional, some a bit malicious.

In the eight weeks or so since this process was initiated, we've been around the team slightly more than once. It's been effective in achieving the primary goal--increasing the ratio of build successes to build failures. I'd like to think it's been effective in achieving some of the harder-to-measure goals as well. (It's certainly increased awareness of the CI process among the less process oriented members of the team, and perhaps taken away some of the mystery of the process.) Maybe time will tell on the other points.

Yakkity yak, blog comes back. #
Sigh. It's been over a month since my last blog entry. I've been buried with work at my day job, where we've shipped 3 products to various manufacturers in the past month, with another 6 or so pending. I've got a mountain of apache email to slog through, but I think I can see the light at the end of the tunnel. I hope to be making more regular postings once again.