| Rod Waldhoff's Weblog |
|
|
Rod Waldhoff's Weblog Monday, 30 June 2003
Like Clover? Check out JCoverage #
Like Clover, JCoverage is a code coverage analyzer for Java. "Instrument" your code with either of these tools, run your unit test suite (really, execute the code in any way), and one can generate a report on what was executed (lines, methods, branches, etc.), and more importantly, what wasn't. Unlike Clover:
With JCoverage, I think I'll need to reconsider my position that coverage analysis may be too slow to execute with every continuous integration build. Thursday, 26 June 2003
Re: Why Java is not Open Source #
The CTO for Sun's Desktop Division Hans Muller writes: I think that one of the primary reasons that Java is not an open source project is that given the size of the developer community, forks are unacceptable. In other words the millions of developers who build software on top of Java value its stability more than they value the right to get under the hood and fix it. Hrrrm. That seems like a moderately controversial statement to me, for several reasons:
I didn't and still don't expect Sun to open source Java, although not for any of the reasons Muller describes. Yet not having followed the java.net phenomenon very closely, I had took it to be a sign that Sun is getting more clueful about the role of the developer community, and of the open source developer community in particular, in the success of Java, especially relative to the Java Community Process. Perhaps I was wrong. Wednesday, 25 June 2003
Experimenting with Jester #
In a previous post I alluded to the use of mutation testing to evaluate the completeness of a test suite, rather than relying upon pure test coverage metrics. In a comment to that post Adewale Oshineye suggested that I check out Ivan Moore's Jester, a Java/JUnit based mutation testing tool. I'd seen Jester before, but I'd never used it nor looked at it in much detail. The anecdote about what Jester uncovered in Bob Martin's and Robert Koss's test driven bowling score calculator example was certainly interesting, so this morning I bit the bullet and downloaded a copy.
Getting Jester up and running was a minor hassle (and it seems like much of that hassle could be alleviated), but I've written my share of open source projects with quirky configuration and installation, so I'll leave that alone. Once my files were placed in the proper position and after tweaking the python scripts to get them to run on whatever version of python I've got on my RedHat box, Jester ran slowly (waiting on javac) but well. It ran much faster once I figured out how to tell Jester to stop mutating my comments (set
(On an unrelated note, is there something more or less equivalent to As an experiment, I took a small component (196 non-blank, non-comment lines of code spread over 34 methods) I knew to be well tested (100% coverage of statements and conditional expressions) and ran it through Jester. It found 21 mutations total, 2 of which didn't lead to a test failure. Those were (with the modified code in bold): if(TRUE || MESSAGE_LOG.isDebugEnabled()) {
MESSAGE_LOG.debug("Broadcasting " + msg);
}
and List list = new ArrayList(_listeners.size() + In the first case, believe it or not, I actually had unit tests that confirm that MESSAGE_LOG generates log events when set to the DEBUG priority, and no log events when set to higher than DEBUG priority. (I wouldn't normally do that, except this is the single log message in that component, and I really wanted 100% coverage. Besides, it wasn't that hard, I just added a mock Appender to that Category, and checked to see if a message was added or not.) Of course, both of these tests still pass, even without the isDebugEnabled call, since the debug method won't generate a log event when using a higher priority. Adding a test that fails as a result of this mutation isn't particularly useful, but it isn't difficult either--I just pass in a mock instance of the msg object and check whether or not the toString method is invoked. Not invoking toString is indeed the reason for this if(isDebugEnabled()) block, so maybe that's not such an odd test to have after all. The second case is the kind of thing Ivan Moore describes as a "false positive" in his writeup on Jester. This code initializes that ArrayList to the precise size it knows will need. Allocating it a little bit bigger or a little bit smaller doesn't alter the functional behavior of the method (although it will be slightly less efficient). Maybe that indicates a premature optimization on my part, but it seems like a pretty small one. In any event I don't see any way to confirm that that List was allocated to precisely the right size without breaking encapsulation profoundly, so I think I'll let that one go. I had hoped to run Jester on some larger, more complicated but less well tested code (3,287 nc,nb lines, roughly 77% coverage) to get a feel for how it works in a more useful scenario, but I've been unable to get it to complete a run on the this larger component. I may poke around with something in-between, but 4,000 lines is on the smallish side for the kinds of modules I'd want to run this on. I may have better luck mutating a single class at a time. In short, I think Jester meets Sam Ruby's criteria for a successful open source project--it's a good idea with a bad implementation. I have some thoughts on how to improve that implementation (mainly obvious ones--e.g., use a ClassLoader and an actual parser, or consider mutating the byte code rather than the source) that maybe I'll cover in a later post. All in all, Jester is an interesting project, and like a lot of things, I wish it worked a bit better. Tuesday, 24 June 2003
Testing Testing #
My fear, and perhaps it's not a well founded one, is that the pursuit of pure test coverage--the percentage of lines, statements, methods, etc. tested--will provide a false sense of completeness. If "percent executed by test code" is your sole metric, it's easy to write superficial tests that execute statements, but actually "test" very little. For example, a couple of weeks ago I added the following test to one of our suites: public void testStartStop() {
AppMain app = new AppMain();
assertFalse(app.isStarted());
app.start();
assertTrue(app.isStarted());
app.stop();
assertFalse(app.isStarted());
}
This single test--directly invoking just four distinct methods and comprising only three assertions--executed an additional 1,500 lines or so, roughly 10% of the code for this module. You can be sure that the bulk of the functionality provided by those 1,500 lines is not actually tested here. Indeed, the following (test first) implementation would suffice: class AppMain {
void start() {
started = true;
}
void stop() {
started = false;
}
boolean isStarted() {
return started;
}
boolean started = false;
}
Is this test useless? At this point I'll argue it isn't. Superficial testing of these 1,500 lines is better than no testing at all. This test at least confirms that we've got the classpath right (including any resources loaded on application startup) and that there aren't any uncaught exceptions being thrown on startup. (When better tests are in place, this superficial test may indeed become useless.) But the resulting test coverage metric is certainly misleading. One solution, probably the right one, is to simply develop the code test first. If every change to production code was made to address a failing test (or to refactor under a unchanging test suite) then one certainly wouldn't find himself in this situation. But this solution doesn't address my (our) current situation, in which we have 100,000 lines of "legacy" code and perhaps 60% of that code was developed "test last" if test at all. The lack of tests has become a drag on our ability to refactor or simply work with that code. If it was sufficient to simply tell folks to develop test first, I wouldn't be as concerned with measuring test coverage in the first place. Mutation testing--in which we randomly change some piece of production code and check to see if our test suite detects the change--is another approach I've seen proposed before, but I'm having trouble buying into that. If we start with poorly factored code, it seems likely that a random change is going to break something, probably profoundly (like leading to an exception being thrown). Even superficial testing will detect those sort of problems. Alternative but indirect metrics may be a better approach. Tracking the number of distinct tests or facts being asserted might give a rough idea of how robust the test suite is. To do this right I think I'd need some numbers from well tested portions of the code base that correlate size or complexity metrics with the number of tests or assertions in the test suite. These might be interesting numbers to collect. Re: Liskov's Substitution Principle and JUnit Testing #
The author of Manageability has discovered the technique of inheriting test cases. This testing strategy is extremely useful when testing multiple implementations of some interface, and is
actually
fairly
common.
In fact, there's not one but two (if not more) general purpose libraries supporting this approach. I distinctly remember seeing this strategy written up in pattern form. You can find similiar write-ups here and here, but neither of those are the one I'm thinking of.
Monday, 23 June 2003
A Frog Boiling Approach to Increasing Test Coverage #
Some thinking out loud about concrete goals for increasing test coverage. The production (i.e., non test) Java code base at my day job consists of roughly 103,000 non-comment, non-blank lines of code, split over 132 "modules". Our automated unit test suite exercises roughly 29% of those lines. (That 29% figure sounds a little bit worse than it feels to me. Some areas of the code base are well tested, several are even at 100% coverage. Others have few if any tests, but as a result of remaining essentially untouched since before a formal unit testing initiative was launched 30 months or so ago. Yet many modules are woefully undertested, and probably not coincidentally several of those have substantial bloat--the number of lines in those modules is way out of proportion with the functionality they provide.) Whether that 29% figure is indicative or not, it's clearly much lower than desirable. (Personally I've been striving for and generally achieving 100% coverage for new development.) I've been thinking a bit about laying out concrete goals for increasing this coverage. We've talked a bit about simply targeting some figure, say 80% coverage, and perhaps some intermediate goals (for example, 40%, then 60%, then 80%). While the "frog boiling" approach--slowly raising the temperature on test coverage--appeals to me, something about the arbitrary "percent coverage" goals doesn't seem right. I think I'd prefer goals that call for complete (100%) coverage of something, perhaps with different values of "something". I'm not entirely sure why. Specifically, I'm thinking of the following stages:
I'm not sure where to go from that point. Complete coverage of conditionals (every boolean expression is evaluated at least once to true and at least once to false) may be a good next step, but isn't as pithy as the other goals. It may be that once we've reached 100% method coverage, 100% line/statement coverage is within easy reach. I suspect that once we've reached that fourth goal, the next step will be pretty clear. I wonder if any reader has some experience with similar strategies for increasing test coverage through a series of concrete goals. What goals did you select, and why? Did a given step turn out to be too large or too small? Tuesday, 17 June 2003
the fixmeister role, or ''you, help me'' #
In some previous posts, I discussed some of the challenges we've experienced maintaining a continuous integration discipline at my day job. In my last post on this topic, I alluded to "additional measures" which we adopted. Time to describe what I was talking about. Although I hadn't seen it at the time of the events described below, I recently ran across a comment on Ward's Wiki that accurately captures the problem we encountered and its solution:
Toward the end of April, in the midst of one of the longer low points in our development cycle, I sent the following (slightly edited) email to my peers on the technology management team. At the time it felt a bit like a failure.
This strategy was well received by the management team, and was implemented almost immediately. This strategy was less well received by some members of the development team, but there was at least grudging acceptance from everyone. The team has come to call this role the "fixmeister" (a name which I'm personally not very fond of). The first fixmeister had the excellent suggestion that each subsequent fixmeister be selected by the current one, which has led to an interesting variety of selection algorithms, some whimsical, some instructional, some a bit malicious. In the eight weeks or so since this process was initiated, we've been around the team slightly more than once. It's been effective in achieving the primary goal--increasing the ratio of build successes to build failures. I'd like to think it's been effective in achieving some of the harder-to-measure goals as well. (It's certainly increased awareness of the CI process among the less process oriented members of the team, and perhaps taken away some of the mystery of the process.) Maybe time will tell on the other points. Yakkity yak, blog comes back. #
Sigh. It's been over a month since my last blog entry. I've been buried with work at my day job, where we've shipped 3 products to various manufacturers in the past month, with another 6 or so pending. I've got a mountain of apache email to slog through, but I think I can see the light at the end of the tunnel. I hope to be making more regular postings once again.
|
recent posts
Currently Reading
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
© Copyright 2003 Rodney Waldhoff.
Last updated: 12/8/2003; 10:41:20 AM. |