Join me in Chicago in September to learn whether test-driven development will work for you. In this course, you will learn the secrets of modular design from one of test-driven development’s master practitioners. Bring your laptop and be prepared to change the way you write software.
It surprises me, from time to time, how much I still need to justify test-driven development to prospects and would-be course attendees. Many feel that TDD has crossed the chasm, while others still see TDD as a cultish practice worth marginalizing. I take some blame for those who find TDD cultish, because until now I haven’t had a strong, sensible, theoretical basis to justify TDD as an idea. I could do no better than “it works for me” or “my friends like it”. That has changed since I’ve started giving my talk “Introduction to Agile with the Theory of Constraints” in which I use concepts from Theory of Constraints to motivate the practices of agile software development, notably those of extreme programming. If you buy in to ideas from Theory of Constraints or Lean Manufacturing, then I think I now have a stronger argument to justify the core programming practices in extreme programming in particular and agile software development in general. I don’t even need all of the Theory of Constraints but rather a simple appeal to fundamental concepts in Queuing Theory.
Queuing Theory?
Yes, Queueing Theory. (And I don’t plan to capitalize that any longer.) I don’t proclaim to have any particular expertise in this area, but I have already seen how to use queuing theory ideas in optimizing network-based systems, and I see no reason we couldn’t extend that to software delivery systems. Better, I only need to appeal to a single idea from queuing theory to make my point.
Given a process B, which follows a process A, sometimes in performing B we need to perform some of A again. We can remove the need to rework by taking some portion of process B and performing it before process A1.
This merits a diagram. If we have this problem
then we can solve it by doing this
and the resulting system will work more efficiently by removing wasteful rework. I assume here that we derive no significant benefit from the rework itself, which I suppose I must justify, but let’s not ruin a good story with the truth. Here I’ve described the general problem, and by applying it to software development, I can… well, I find it more effective if I save the punchline for the end.
Winston Royce, 1970, revisited
I imagine you know this diagram
and appreciate that Royce wrote in his now infamous paper that this single-phase waterfall is risky and invites failure. If you don’t appreciate that, then I cannot strongly recommend enough your reading the original paper in its entirety, rather than stopping after page 2 as most people have done2.
We can apply the queuing theory result I’ve just cited to this diagram and generate some interesting conclusions. I’ll start by focusing in on this portion of the system
We write code, then we test it. Sadly, we occasionally find a bug3 which makes us change the code we wrote after we thought we’d finished it. That makes a loop of the type we can unravel with our queueing theory result.
Since “coding” is process A and “testing” is process B, we need to do some testing before we start coding.
It doesn’t take long for this to become a virtuous loop where we writing only the code we need to write in order to pass the tests we write.
I use the term test-first programming to describe this cycle4. When we practise test-first programming, we design as much detail as we can before writing the first test, then use the tests to help us type in our implementation correctly. Most teams most of the time can use test-first programming to reduce their defect mistake count to near zero, which increases their productivity and improves their ability to deliver, by helping them waste less time agonizing over whether to fix mistakes late in a release. I started this way in 2000 when I first discovered JUnit and stopped making silly mistakes in the code I wrote, which I found significantly beneficial in helping me code more confidently. I still designed most of what I built mostly up front.
After a while, though, I recognized a new process loop: I found some parts of my design difficult to test, or I found some parts of my design didn’t fit together when I tried to type them in.
Returning to our queuing theory result, since “designing” is process A and “doing test-first programming” is process B, we need to do some test-first programming before we start designing.
It doesn’t take long for this to become a virtuous loop where we check our design ideas as we think of them and implement only the parts of the design we can justify needing. When we include refactoring in our practice, we can confidently “under-design” compared to the level of design we expect to need by the end of a task, which I believe amounts to designing appropriately for the code we need to implement right now. This virtuous loop combines test-first programming and evolutionary design, including guiding principles like “you aren’t gonna need it” and the four elements of simple design into test-driven development, where we check our implementation by running tests and we check our design ideas by writing tests.
Where test-first programming helps most teams most of the time reduce their mistake count to near zero, test-driven development helps them reduce their design inventory—mostly code that gets in our way because it doesn’t actively help us deliver a feature—to near zero. This further increases productivity and improves their ability to deliver by helping them waste less time agonizing over design problems they find costly to fix. I waited until I’d spent an entire release practising test-first programming before doing more test-driven development. My transition consisted of trying to do less and less up-front design for each task, letting myself feel comfortable with each new step. Within two years I estimate I designed about 5% as much up front as I did before I started practising test-first programming. I can’t measure the corresponding improvement in my design, but I look back at projects that took 3 months before I practised test-driven development that I now feel confident I could complete—truly complete—in one week. Of course, we can’t stop here!
Enter our friend analysis. To simplify the discussion, I will treat analysis as “discovering the features we want in our software” without forcing myself to state too precisely how that happens5. Once again, we have our familiar situation.
Once again, we face the situation where in the process of implementing features we discover new features we need, current features we don’t need, and learn new things about features we know we need to build. This adds to our analysis, meaning that we should try test-driving some features before we try to implement others.
It doesn’t take long for this to become a virtuous loop in which our desire to implement (and deliver!) features drives them ever smaller, as we extract more concentrated value out of each one6. When we implement feature 12 we learn something about features 23, 30 and 52. We might decide not to deliver feature 30 any more. We might decide to expand feature 23 to encompass a few more key cases. We might decide to rush feature 52 to the top of the pile. Most teams most of the time find that this cycle helps them reduce the number of rarely- or infrequently-used features in their system7. This yet again increases productivity and improves their ability to deliver meaningful software to their stakeholders by eliminating the time wasted on delivering too much of a feature too soon, the time wasted on entire features we thought we needed but realized we don’t, and the time wasted arguing about what a feature means, rather than writing examples together: business-oriented tests that describe how a feature works in enough detail for the business and technical project community to agree on the conditions of satisfaction for delivering the feature.
I call this behavior-driven development, and refuse to spell it with the u that provides as much value to the word as your appendix does to your body8.
Once again, I didn’t coin the phrase, and some might argue against the way I use it, but I find it apt. This cycle include practices like business and technical people writing examples together, feature injection, feature splitting, and value-based (rather than cost-based) planning.
At this point, I think I’ve done my job. I believe I’ve justified not only test-first programming or test-driven development, but full-on behavior-driven development, using only a single result from fundamental queuing theory. I’ve made only a single assumption—that we agree on the appropriateness of applying queuing theory to a software development system. I’ve tried to add as little as possible to my reasoning in order to keep it as context-free as possible. As a result I claim that most teams most of the time will benefit from moving along the path from code-and-fix to test-first programming to test-driven development to behavior-driven development.
Now, for homework, what happens when we consider these processes?
Surely at least one you’ve needed to deliver more features for software you’d already deployed. How well does that work? What problems do you encounter? What if you applied our new favorite queuing theory result to that rework loop?
1 I really need a citation for this, and when I find it, I will place it here.
2 I digress, but I really can’t help myself on that one.
3 Also known as defect or, for the truly congruent, mistake.
4 Clearly I didn’t coin the phrase, but I know many people who treat “test-driven development” as a simple renaming of “test-first programming”, and I believe making a stronger distinction adds real value to the conversation.
5 I don’t think “gathering requirements”, as though we could pick them like berries, fits as a metaphor. I like “trawling for requirements”, which I believe I first read in Mike Cohn’s User Stories Applied.
6 We can easily apply the “Pareto Distribution” here in that we can deliver 80% of the value from implementing 20% of the feature.
7 You recall that Jim Johnson of the Standish Group reported in 1994 that 45% of developed features are “never used”. As I recall, only 7% of features were used very frequently.
8 My Canadian and British brethren and sistren be damned. I assert my right as a Canadian to choose the British spelling when I prefer it and the American spelling when it saves me time.
A Mars rover mission failed because of a lack of integration tests. The parachute system was successfully tested. The system that detaches the parachute after the landing was successfully – but independently – tested. On Mars when the parachute successfully opened the deceleration “jerked” the lander, then the detachment system interpreted the jerking as a landing and successfully detached the parachute. Oops. Integration tests may be costly but they are absolutely necessary.
I don’t doubt the necessity of integration tests. I depend on them to solve difficult system-level problems. By contrast, I routinely see teams using them to detect unexpected consequences, and I don’t think we need them for that purpose. I prefer to use them to confirm an uneasy feeling that an unintended consequence lurks.
Let’s consider a clean implementation of the situation my commenter describes. I see this design, comprising the lander, the parachute, the detachment system, an accelerometer and an altimeter. A controller connects all these things together. Let’s look at the “code”, which I’ve written in a fantasy language that looks a little like Java/C# and a little like Ruby.
Ashley Moran has posted a working Ruby version of this example. If you speak Ruby, then I highly recommend looking at that example after you’ve read this.}
Now the test for DetachmentSystem, which acts as an AccelerationObserver. What should it do if it detects such sudden deceleration? It should detach the parachute.
Since this test expects the parachute to be able to detach, I have to test that. Now, detaching only works if we’ve landed. (I’ve simplified on purpose. Suppose the parachute can’t survive a drop from any height. It’s easy to add that detail in later.)
Hm. I notice that parachute.detach() might fail. But I just wrote a test that uses parachute.detach() and doesn’t yet show how it handles that method failing. I have to test that.
Hm. So handling an acceleration report of -50 m/s2 can fail. Who might issue such a right? The accelerometer. Since the detach system doesn’t handle this failure, I have to test what the accelerometer does when issuing an acceleration report might fail.
It turns out that the accelerometer might fail when reporting acceleration of -50 m/s2. When might it do that? When the lander decelerates. What happens then?
So the parachute opening could cause it to detach because the lander hasn’t landed yet. I don’t know about you, but I think the parachute provides the most value when its helps the lander land, and not once it has landed. That tells me that someone, somewhere needs to handle the exception that detach() would raise, or at least prevent detach() from happening while the altimeter reads above a few meters off the ground.
In writing this test, I see that in order to stop the detachment system from telling the parachute to detach, it needs access to the altimeter.
Integration problem detected. When I wire the detachment system up to the altimeter, even the collaboration test shows how to ensure that the parachute doesn’t detach in this kind of dangerous situation.
Integration problem solved with no integration tests. Instead, I have a bunch of collaboration tests, one important contract test, and the ability to notice things a systematic approach to choosing the next test, which I describe in the comments below. Any questions?
Dan Fabulich rightly jumped on me for using the phrase “an ability to notice things” just a little earlier in this article. I choose that phrase lazily because I didn’t want to patronize you by writing, “an ability to perform basic reasoning”. Oops. I thought about how I choose the next test, and I decided to take the time to include that here. Enjoy.
In this example, I used no magic to choose the next test; but rather some fundamental reasoning.
Every time I say “I need a thing to do X” I introduce an interface. In my current test, I end up stubbing or mocking one of those tests.
Every time I stub a method, I make an assumption about what values that method can return. To check that assumption, I have to write a test that expects the return value I’ve just stubbed. I use only basic logic there: if A depends on B returning x, then I have to know that B can return x, so I have to write a test for that.
Every time I mock a method, I make an assumption about a service the interface provides. To check that assumption, I have to write a test that tries to invoke that method with the parameters I just expected. Again, I use only basic logic there: if A causes B to invoke c(d, e, f) then I have to know that I’ve tested what happens when B invokes c(d, e, f), so I have to write a test for that.
Every time I introduce a method on an interface, I make a decision about its behavior, which forms the contract of that method. To justify that decision, I have to write tests that help me implement that behavior correctly whenever I implement that interface. I write contract tests for that. Once again, I use only basic logic there: if A claims to be able to do c(d, e, f) with outcomes x, y, and z, then when B implements A, it must be able to do c(d, e, f) with outcomes x, y, and z (and possibly other non-destructive outcomes).
I simply kept applying these points over and over again until I stopped needing tests. Along the way, I found a problem and fixed it before it left my hands.
If I can describe the steps well enough for others to follow – and I posit I’ve just done that here – then I don’t agree to labeling it “magic”.
I take great pleasure in announcing the first ever Code Retreat outside the US, scheduled in Reykjavík, Iceland on May 9, 2009. I had the pleasure of co-pre-announcing this event at my March 2009 course there and feel great that the guys at Sprettur have fulfilled their commitment to organize this event. I hope this represents the first step towards a vibrant Software Craftsmanship community in Iceland.
Please support your local Code Retreat event. I will present one in association with PlateSpin in Toronto on August 8, 2009.
I fear that this first article in the series may be attacking a view that very few people hold: the idea that one should test all code paths by integration testing alone. — Dan Fabulich
When I tell TDD practitioners my opinion about integration tests, some treat my position as a straw man. They point out that “no one” seriously tries to test entire systems exclusively with integration tests. While I understand their reaction, I need to point out that I never made that claim. I see far more damaging behavior in teams that practise TDD: they duplicate a sizable amount of their effort by designing their objects with thorough focused tests, then adding a suite of integration tests that verify a substantial amount of the same behavior. I understand why they do it. I used to do it. And I want them to stop.
Every integration test costs… well, I don’t know how to accurately say how much it costs. After computing the superficial cost of writing and maintaining the test, I quickly lose track of the varying effects of writing integration tests in place of, or even in addition to, focused object tests. I can compute the raw execution time tax on integration tests: an average focused test executes in 4 ms, while an average integration tests takes closer to 100 ms. I feel comfortable estimating the difference at a more conservative order of magnitude base 10. Beyond that, I find myself lost in the implications of writing integration tests to form a clear picture of the cost. Let me give you an idea of what I mean.
A Tale of Two Test Suites
Consider two test suites. One executes in 6 seconds, and the other in 1 minute. Pretend they cover the same code equally well. I mean that they have the same power to uncover mistakes in the system. Now imagine yourself writing code and executing the 6-second suite. You make a handful of edits, then you run the tests. What do you do for 6 seconds? You predict the outcome of the test run: they will all pass, or the new test will fail because you’ve just written it, or the new test might pass because you think you wrote too much code to pass a test 10 minutes ago. In that span of time, you have your result: the tests all pass, so now you refactor. You probably needed about 6 seconds to read up to here.
Now imagine you run the 1-minute test suite. Once again, you predict the result, during which time 6 seconds pass. If you work alone, then after 8 seconds you’ve started drumming your fingers on the desk or letting your eyes dart around the room. You notice the long list of tasks on the team task board. You start to feel your stomach rumble, noticing the time: 11:42. Time for lunch soon. You wonder what the cantina has for lunch, so you point your browser at their intranet site. Tilapia sounds good. You wonder whether Lisa will join you for lunch, so you switch to your email client. Before you write her, you notice a notification to pay your credit card bill. You can do that in 30 seconds, so you switch back to your browser to log in to online banking and quickly make a payment. It turns out Lisa has a lunch meeting, and you reconsider your choice of fish. Today, you decide, feels like a burger day. In the time you imagined yourself doing that, assuming you guessed how long it took to actually do what you imagined, over 1 minute passed. The computer has spent valuable computing time waiting for you.
Pairing doesn’t seem to solve this problem. If you ran this test suite during a pair-programming session, then you probably spent time chatting. At first, you discussed the recent test. After a while, you discussed the task. That killed about 40 seconds, so you started drifting to other topics: the weekend, the kids, XBox, Battlestar Galactica, baseball, management… then you turned around to notice the test run finished while you were arguing whether Cliff Lee deserved the Cy Young award. I don’t mind injecting plenty of relaxed conversation into my work, but when waiting repeatedly for a 1-minute test suite it doesn’t take long to run out of things to talk about.
I need to point out the dual cost here. The first, we can easily see and measure: the time we spend waiting for the tests plus the time the computer waits for us, because we find it hard to stare at the test runner for 60 seconds and react to it immediately after it finishes. I don’t care much about that cost. I care about the visible but highly unquantifiable cost of losing focus.
TDD works well for me in large part because it helps me focus. When I write a test, I clarify my immediate goal, focus on making it pass, then focus on integrating that work more appropriately into the design. I get to do this in short cycles that demand sustained focus and allow brief recovery1. This cycle of focus and recovery builds rhythm and this rhythm builds momentum. This helps lead to the commonly-cited and powerful state of flow. A 6-second test run provides a moment to recover from exertion; whereas a 1-minute test run disrupts flow. It acts like an annoying short interruption every few minutes. We can try to measure the cumulative effect of these interruptions, but I guess you can imagine a day, possibly a recent one, when periodic short interruptions made it nearly impossible for you to concentrate. How productive did you feel that day? How much did you achieve? How much pressure did you feel to catch up the next day? How relaxed did you feel that evening at home? Did you enjoy dinner? Did you feel present for your spouse or kids or pets? How well did you sleep? How refreshed did you feel the next morning?
Among the early TDD literature I distinctly remember reading that practising TDD would help me focus, relax, achieve more and feel better at the end of a task. I remember agonizing over integration tests. Teams call me expressly to learn how to tame big, slow, brittle test suites. They don’t call me when they feel focused, relaxed and productive. I tell you: integration tests will slowly kill you.
So What Now?
But you have integration tests now, and you haven’t yet learned about the alternatives. How can you cope with your reality? You could regain your focus by running the most important 10% of those tests. That would take 6 seconds and fit into your flow. It also runs a substantial risk of failure. You’ve experienced this. Remember the last time you changed a line of code in one part of the system and it broke something way over there in another module? How did you feel when that happened? How long did you spend tracking down a mistake in some arcane part of they system that perhaps no one understands? How did you deal with having to branch your code changes to deal with the bigger problem? How many times have you told your wild goose chase story to your fellow programmers? How long did you need to recover before returning to a decent state of flow while working on your original task?
So it appears you have a choice between frequent annoying disruptions and less frequent but comparatively catastrophic disruptions. A Morton’s Fork you can blame squarely on integration tests. Stop writing them.