Headshot-color me@jbrains.ca Find out where I'm appearing
« Previous 1

How test-driven development works (and more!)

It surprises me, from time to time, how much I still need to justify test-driven development to prospects and would-be course attendees. Many feel that TDD has crossed the chasm, while others still see TDD as a cultish practice worth marginalizing. I take some blame for those who find TDD cultish, because until now I haven’t had a strong, sensible, theoretical basis to justify TDD as an idea. I could do no better than “it works for me” or “my friends like it”. That has changed since I’ve started giving my talk “Introduction to Agile with the Theory of Constraints” in which I use concepts from Theory of Constraints to motivate the practices of agile software development, notably those of extreme programming. If you buy in to ideas from Theory of Constraints or Lean Manufacturing, then I think I now have a stronger argument to justify the core programming practices in extreme programming in particular and agile software development in general. I don’t even need all of the Theory of Constraints but rather a simple appeal to fundamental concepts in Queuing Theory.

Queuing Theory?

Yes, Queueing Theory. (And I don’t plan to capitalize that any longer.) I don’t proclaim to have any particular expertise in this area, but I have already seen how to use queuing theory ideas in optimizing network-based systems, and I see no reason we couldn’t extend that to software delivery systems. Better, I only need to appeal to a single idea from queuing theory to make my point.

Given a process B, which follows a process A, sometimes in performing B we need to perform some of A again. We can remove the need to rework by taking some portion of process B and performing it before process A1.

This merits a diagram. If we have this problem

then we can solve it by doing this

and the resulting system will work more efficiently by removing wasteful rework. I assume here that we derive no significant benefit from the rework itself, which I suppose I must justify, but let’s not ruin a good story with the truth. Here I’ve described the general problem, and by applying it to software development, I can… well, I find it more effective if I save the punchline for the end.

Winston Royce, 1970, revisited

I imagine you know this diagram

and appreciate that Royce wrote in his now infamous paper that this single-phase waterfall is risky and invites failure. If you don’t appreciate that, then I cannot strongly recommend enough your reading the original paper in its entirety, rather than stopping after page 2 as most people have done2.

We can apply the queuing theory result I’ve just cited to this diagram and generate some interesting conclusions. I’ll start by focusing in on this portion of the system

We write code, then we test it. Sadly, we occasionally find a bug3 which makes us change the code we wrote after we thought we’d finished it. That makes a loop of the type we can unravel with our queueing theory result.

Since “coding” is process A and “testing” is process B, we need to do some testing before we start coding.

It doesn’t take long for this to become a virtuous loop where we writing only the code we need to write in order to pass the tests we write.

I use the term test-first programming to describe this cycle4. When we practise test-first programming, we design as much detail as we can before writing the first test, then use the tests to help us type in our implementation correctly. Most teams most of the time can use test-first programming to reduce their defect mistake count to near zero, which increases their productivity and improves their ability to deliver, by helping them waste less time agonizing over whether to fix mistakes late in a release. I started this way in 2000 when I first discovered JUnit and stopped making silly mistakes in the code I wrote, which I found significantly beneficial in helping me code more confidently. I still designed most of what I built mostly up front.

After a while, though, I recognized a new process loop: I found some parts of my design difficult to test, or I found some parts of my design didn’t fit together when I tried to type them in.

Returning to our queuing theory result, since “designing” is process A and “doing test-first programming” is process B, we need to do some test-first programming before we start designing.

It doesn’t take long for this to become a virtuous loop where we check our design ideas as we think of them and implement only the parts of the design we can justify needing. When we include refactoring in our practice, we can confidently “under-design” compared to the level of design we expect to need by the end of a task, which I believe amounts to designing appropriately for the code we need to implement right now. This virtuous loop combines test-first programming and evolutionary design, including guiding principles like “you aren’t gonna need it” and the four elements of simple design into test-driven development, where we check our implementation by running tests and we check our design ideas by writing tests.

Where test-first programming helps most teams most of the time reduce their mistake count to near zero, test-driven development helps them reduce their design inventory—mostly code that gets in our way because it doesn’t actively help us deliver a feature—to near zero. This further increases productivity and improves their ability to deliver by helping them waste less time agonizing over design problems they find costly to fix. I waited until I’d spent an entire release practising test-first programming before doing more test-driven development. My transition consisted of trying to do less and less up-front design for each task, letting myself feel comfortable with each new step. Within two years I estimate I designed about 5% as much up front as I did before I started practising test-first programming. I can’t measure the corresponding improvement in my design, but I look back at projects that took 3 months before I practised test-driven development that I now feel confident I could complete—truly complete—in one week. Of course, we can’t stop here!

Enter our friend analysis. To simplify the discussion, I will treat analysis as “discovering the features we want in our software” without forcing myself to state too precisely how that happens5. Once again, we have our familiar situation.

Once again, we face the situation where in the process of implementing features we discover new features we need, current features we don’t need, and learn new things about features we know we need to build. This adds to our analysis, meaning that we should try test-driving some features before we try to implement others.

It doesn’t take long for this to become a virtuous loop in which our desire to implement (and deliver!) features drives them ever smaller, as we extract more concentrated value out of each one6. When we implement feature 12 we learn something about features 23, 30 and 52. We might decide not to deliver feature 30 any more. We might decide to expand feature 23 to encompass a few more key cases. We might decide to rush feature 52 to the top of the pile. Most teams most of the time find that this cycle helps them reduce the number of rarely- or infrequently-used features in their system7. This yet again increases productivity and improves their ability to deliver meaningful software to their stakeholders by eliminating the time wasted on delivering too much of a feature too soon, the time wasted on entire features we thought we needed but realized we don’t, and the time wasted arguing about what a feature means, rather than writing examples together: business-oriented tests that describe how a feature works in enough detail for the business and technical project community to agree on the conditions of satisfaction for delivering the feature.

I call this behavior-driven development, and refuse to spell it with the u that provides as much value to the word as your appendix does to your body8.

Once again, I didn’t coin the phrase, and some might argue against the way I use it, but I find it apt. This cycle include practices like business and technical people writing examples together, feature injection, feature splitting, and value-based (rather than cost-based) planning.

At this point, I think I’ve done my job. I believe I’ve justified not only test-first programming or test-driven development, but full-on behavior-driven development, using only a single result from fundamental queuing theory. I’ve made only a single assumption—that we agree on the appropriateness of applying queuing theory to a software development system. I’ve tried to add as little as possible to my reasoning in order to keep it as context-free as possible. As a result I claim that most teams most of the time will benefit from moving along the path from code-and-fix to test-first programming to test-driven development to behavior-driven development.

Now, for homework, what happens when we consider these processes?

Surely at least one you’ve needed to deliver more features for software you’d already deployed. How well does that work? What problems do you encounter? What if you applied our new favorite queuing theory result to that rework loop?


1 I really need a citation for this, and when I find it, I will place it here.

2 I digress, but I really can’t help myself on that one.

3 Also known as defect or, for the truly congruent, mistake.

4 Clearly I didn’t coin the phrase, but I know many people who treat “test-driven development” as a simple renaming of “test-first programming”, and I believe making a stronger distinction adds real value to the conversation.

5 I don’t think “gathering requirements”, as though we could pick them like berries, fits as a metaphor. I like “trawling for requirements”, which I believe I first read in Mike Cohn’s User Stories Applied.

6 We can easily apply the “Pareto Distribution” here in that we can deliver 80% of the value from implementing 20% of the feature.

7 You recall that Jim Johnson of the Standish Group reported in 1994 that 45% of developed features are “never used”. As I recall, only 7% of features were used very frequently.

8 My Canadian and British brethren and sistren be damned. I assert my right as a Canadian to choose the British spelling when I prefer it and the American spelling when it saves me time.

Interpreting inaccurate estimates

I ran across this today and thought I’d comment briefly about it. How can you interpret inaccurate estimates? What might it mean when you estimate incorrectly?

I know individuals that beat themselves up over inaccurate estimates. I know teams that beat each other up. I know managers that beat up their teams. I even know a company that awards bonus compensation based on the accurate of their programmers’ estimates. In those environments, inaccurate estimates hurt, so although I don’t like cost estimates in general for most teams most of the time, I recognize that some people must play the game to keep their job. That said, I return to something Ward Cunningham said in the early days about Fit: when we look at test results in a spreadsheet, we can look at patterns of red cells to learn something about our patterns of failure. He drew our attention to two kinds of patterns: systematic failure and sporadic failure.

In the early XP literature, books like Planning XP told us that by measuring velocity we gracefully handle systematic estimate inaccuracy as the “value” of our story points fluctuates over time. Think of it like a floating currency: its value in hours changes slowly over time as our ability to complete points improves. If you prefer to estimate in hours, then after several months, you might notice a bunch of stories whose actual costs were close to some constant multiplier of the estimate. In that case, you would consider multiplying all remaining estimates by that constant until you internalized it and began estimating in the “new scale” out of habit. Even when I cared deeply about cost estimates, I worried little about systematic inaccuracies.

Sporadic inaccuracy hurts more. Since we provide cost estimates to that others might plan, we owe it to them to lower the variance of the actual cost of our work, in order to make planning more useful. I have developed a conjecture over the last few years that changes the question of sporadic estimate inaccuracy:

The actual cost of a story depends on (among other things) the complexity of the story and the current state of the design. The higher our technical debt, the more that cost dominates the cost of the story.

I have noticed various people floating the definition of “technical debt” to suit their needs, so I want to clarify what I mean by “technical debt”:

By technical debt I refer to the latent cost of the amount of rot in the design. I think of technical debt as the cost of the design improvements we need to make in order to feel comfortable adding features or fixing defects in that part of the system.

You can think of technical debt as financial debt: interest-bearing principal owed to another party. In this case, the other party is the system or the project, the principal is the current design and the interest is the extra cost associated with either rescuing or working around the current design. With these definitions established, I can state my conjecture this way:

In systems with high technical debt, the cost of repaying that technical debt dominates the cost of a story.

Since not all stories affects all parts of the design uniformly, we can do a little better:

The cost of a story depends on the complexity of the story and the amount of technical debt in the areas of the design we need to change or extend to deliver the story. Working in areas with high technical debt causes the cost of repaying that debt to dominate the cost of the story.

I think you get the point. From this it follows that in systems with generally high technical debt, the distribution of technical debt effectively determines the variance in the cost of the stories. Sporadic estimate inaccuracy, then, likely has a clear root cause: high technical debt distributed decidedly non-uniformly. This follows logically, because if the system distributed technical debt uniformly, then our estimates would show systematic inaccuracy.

This allows me to make two broad claims:

  1. In general, if a system has high and sporadic technical debt, we’ll tend to estimate with sporadic inaccuracy even if we estimate the relative complexity of those stories with perfect accuracy.
  2. In general, we should estimate with at most systematic inaccuracy when delivering stories for a greenfield system or component.

I can interpret these claims more pithily:

  1. If your estimates suck when adding features to a legacy system, blame the shittiness of the codebase.
  2. If your estimates such when adding features to mostly clean code, blame the shittiness of the programmers.

In particular, if you feel bad because your estimates suck when adding features to a legacy system, you can relax. As you attempt to build those features with high discipline, the resulting volatility in your actual costs will come from the current state of the design. The design will actually try to convince you to cut corners. Cutting corners can only improve the accuracy of your estimate for the current story at the expense of the remaining stories. Cutting corners can only get your project manager off your back for a day or two. You will eventually need to stop cutting corners.

Part 3: The risks associated with lengthy tests

I just read a tweet from Dale Emery that turned my attention back to the topic of integration tests and their scamminess.

Since practitioners tend to write acceptance tests as end-to-end (or integration) tests, I think I can safely substitute the phrase “integration tests” here for “acceptance tests” and retain the essence of Dale’s meaning. I do this because I don’t want you to conclude from what I plan to write that I treat acceptance tests with the same disdain as I treat integration tests. I already went through that when Eric Lefevre-Ardant introduced us to David, Agile Developer, one of the personas that the Agile 200x conference has developed to help people choose sessions at the conference. While I felt flattered that he chose my session as one to attend, he accidentally misnamed it “Acceptance Tests Are A Scam”, which set off a miniature firestorm in Twitterland. In short: I like acceptance tests when we write them to confirm the presence of a feature; and I dislike them when programmers write integration tests, checking the design and behavior of large parts of the system, and call them “acceptance tests” to justify their existence.

Back to Dale’s question, which I paraphrase: how often do we write faulty integration tests, meaning that the test failure points to an error in the test, rather than in the production code? Rather than attempt to answer that question, I prefer to write about a strongly related idea: integration tests necessarily fail more frequently and in a more costly manner than isolated object tests, even when the underlying production code behaves as expected. To simplify the discourse a bit, let me introduce the term unjustifiable test failure to mean a test failure without a corresponding defect in the production code. When an incorrect test fails, I will call that failure unjustifiable.

The cost of unjustifiable test failures

An unjustifiable failure has both a clear cost an a hidden cost. We know the immediate, clear cost: an unjustifiable failure causes me to do root cause analysis on a nonexistent failure, which costs me something and gains me nothing. More insidious, though, persistent false failures erode my confidence in the tests. I tend to value the tests less. I run them less frequently, reducing the actual value I get from the resulting feedback. With less feedback comes less confidence in the code, and more conservative behavior. I change the code less frequently; I avoid extensive changes, even when they seem appropriate; I entertain fewer ideas because I can’t as easily predict the cost of the corresponding changes. I start designing not to lose, rather than designing to win. I can’t quantify that cost on a given project, but I know it in my heart and we could measure it over time. I think one should eliminate unjustifiable test failures where possible, or at least where easy, and integration tests simply cause an avoidably large number of unjustifiable failures.

Integration tests fail unjustifiably more frequently

Let me support this conjecture with two key arguments.

First, integration tests tend to require more lines of code than isolated object tests. Perhaps more formally, as we write more integration tests and more isolated object tests in a system, the average length of the integration tests becomes considerably larger—at least double—than the average length of the corresponding isolated object test. If we accept this premise, then combine it with the well-accepted premise that more code means more defects in general, then it follows directly that integration tests tend to have more defects than isolated object tests. This means that integration tests fail unjustifiably more frequently than isolated object tests.

Next, because integration tests rely on the correctness of more than one object, it follows directly that a defect in an object results in more integration test failures as compared to the number of failures in corresponding isolated object tests. That production defect, then, results in two classes of test failures: justifiable ones in tests designed to verify the defective behavior, and unjustifiable ones in tests design to verify another behavior, but that happen to execute the defective code.

You can envision an example of the latter case by thinking of an integration test that verifies a specific alternate path in step 4 of a 5-step process. This test must execute steps 1 through 3 of the process in order to execute step 4, so if we have a defect in step 2 of the process, then this test fails unjustifiably, because it does not actively try to verify step 2. While the test failure can be justified by a defect in step 2, I call the failure unjustifiable with respect to the behavior under test, because this test does not deliberately attempt to test step 2. Presumably, we have tests that intend to test step 2, which justifiably fail.

Integration tests, then, result in unjustifiable failures by executing some potentially defective behavior without intending to verify it. While I wouldn’t call this a defect in the test, the test nevertheless fails unjustifiably.

I have tried here to describe the problem of unjustifiable test failures and to explain how integration tests necessarily result in more unjustifiable test failures than isolated object tests. I admit that I have not compared the cost of these unjustifiable test failures to the corresponding costs of writing isolated object tests. I cannot hope to complete a thorough quantitative study on the matter. Instead, I simply want to raise the issues, make some conjectures, reason well about them, then let the reader decide. I have decided to write more isolated object tests and fewer integration tests unless I find myself in a drastically different context than the ones I’ve seen over the past decade or so.

Story Test-Driven Development: don't start here

I don’t want to claim that story test-driven development doesn’t work, because some of my most respected colleagues teach the practice with success; however, I do want to warn people who might find themselves seduced by STDD, especially if they think of it as an easy replacement for TDD.

Allow me to clarify the two terms, TDD and STDD. To practice TDD, the programmer begins with a small, well-defined behavior they’d like to implement. Typically, they design that behavior as a method on a class, although they could get away with doing even less, then brainstorm a list of tests they might write. With such a list in hand, they run through the TDD cycle, illustrated beautifully by Bill Wake’s stoplight analogy. When the design behaves adequately and correctly, the programmer stops.

To practice STDD, the programmer begins with a story and several story tests, which I tend to call “examples”. The programmer then selects a story test, watches it fail, then test-drives enough code to make it pass. One by one, the programmer makes each story test pass until they complete the entire story.

I have been teaching people about TDD and stories for years, and have practiced STDD most of that time, in one form or another. I find the technique helpful; however, when I have pushed STDD to its limit, I have found it to guide me in directions I don’t like, which TDD has generally never done. When I watch others attempt to practice STDD, especially novices and advanced beginners, I see how they misapply STDD and lead themselves towards a Big Ball of Mud, despite what the agile community’s marketing machine says about TDD and stories. I believe the intersection of the two creates problems for those not accustomed to the different goals of TDD and user stories.

I use examples, the term I use for story tests, to show progress towards delivering a story, or feature. Broadly, I add examples to reflect increasing levels of understanding of the system to design, and as examples pass, that reflects progress towards delivering an ever more powerful system. I use programmer tests, the term I use in place of unit tests, to test my design ideas as they come to me and to help me type code in correctly. Any time all the programmer tests pass, the system works as designed, even if it does not yet do everything the business needs. Any time all the programmer tests pass, I can freely commit changes to the main line of the project’s design repository.

More succinctly, examples help us design the right system and programmer tests help us design the system right. (I prefer “correctly” there, but then I lose the symmetry.)

I often see programmers try to use passing examples as an absolute criterion to stop designing. They underestimate, in my opinion, the role of programmer tests to put positive pressure on their design. Examples, especially when written as end-to-end or integration tests (a test whose failure does not isolate the mistake to a single method), simply do not put positive pressure on a design: their high-level nature can’t constrain a design enough to support careful refactoring. For this reason, I recommend novices and advanced beginners not practice STDD until they first see or feel for themselves the impact focused, small programmer tests have on their design.

I want to leave no room for doubt: I do not mean to say that novices should avoid STDD as an “advanced practice”; but rather that a combination of novice tendencies makes STDD harder than TDD to practice well. Specifically, the novice tends to write examples as end-to-end tests, which provide too much design freedom and exert too little positive pressure on the design to guide refactoring and prevent defects. Instead, I would counsel novices and advanced beginners to focus on TDD and run the examples every hour or so to measure their progress towards delivering the story.

Read more about how to practice STDD well.

Forget velocity

What does velocity measure?

First, let’s be clear: by velocity, I mean story points delivered per iteration over time. Of course, that means we need more definitions, so let me get those out of the way now.

“A story point” is a unit that programmers invent solely to make it easier to estimate the work it takes to deliver a story. A story point is however big the programmers decide it is, no more, and no less. Story points are useful to the extent that they help programmers compare the effort required to deliver different stories, so that the team can decide how many stories to commit to delivering in a given iteration. Although story points necessarily fluctuate in value, one hopes that over a sufficiently short span of time – say the length of a release – the fluctuation is small enough that a constant approximation of velocity is helpful enough to plan the release.

“Delivered” means realizing (in the accounting sense) the value the customer intended to realize by asking the team to work on the story. A story is not delivered until it either reduces cost or generates revenue.

“Iteration” means equal-length time boxes the goal of which is to provide natural breaks in daily work to reconsider the plan and stop work from expanding indefinitely. On teams that have less trouble stopping when they’ve done enough, iterations are less important.

“Over time” refers to the trend of spot measurements over at least the length of a release.

I hope that makes it crystal clear what I mean by velocity. Now that I’ve clarified the meaning of velocity, what does it measure? It simply aggregates past estimates of effort of a set of stories the team selected to complete in the last several iterations. That doesn’t sound like much, does it? There’s more. Here are some things that velocity does not measure:

  • productivity
  • efficiency
  • value
  • suitability to re-hire or retain

…and, of course, past performance does not guarantee future results. Read your prospectus before investing. If velocity is so meaningless, then why measure it? Well, it turns out that:

  • It is a good approximation (accurate about 2 times in 3, we think) of what the team can commit to in the next iteration
  • It is a good approximation (we think) of what the team can commit to over the next handful of iterations (say 4-6)
  • It is a good approximation of when the current stack of work we’ve identified will be done, as long as the stack of work isn’t too large (up to about 4-6 iterations’ worth)

I have just witnessed a conversation on the extremeprogramming Yahoo! group that illustrates the knots in which we tie ourselves when we allow ourselves to be too imprecise about what a measurement measures. The whole thing puts me in mind of personal financing and when budgets don’t work.

Think about budgeting your expenses at home: you know how much you actually pay for some expenses, like car payments, mortgage or rent, insurance… the amounts you owe are usually spread out over such a long period that you can anticipate what you’ll owe next month with certainty. Moreover, you know that it takes concerted effort on someone’s part, or human error, to change the amount you owe on a mortgage, car loan or your home insurance. Changes in those amounts, at least correct ones, generally don’t sneak up on you.

There are other expenses that vary from month to month: food, clothing, entertainment, communications… the amounts you owe usually depend on how much of a given resource you consume, such as how much you eat, how much you talk on the phone and how stressful your work is that month. You can budget these amounts fairly accurately, because the variation is low enough and mostly under your control: you can choose to eat less, take better care of your clothes, read more library books and rent fewer movies. Not only is it easier to guess what you’ll spend in a given month, but if you start spending too much, you can easily correct course to spend less.

There are still other expenses that you just can’t foresee. You are injured in an accident and rack up $20k in hospital bills; this is the year out of the last ten that you finally need new shingles on the roof; you find out your son needs braces. You might buy insurance as a way of smoothing out the costs over time, but they are generally big expenses that you know are coming in the larger sense, but which arrives without warning and needs to be paid immediately.

These are the expenses that make budgeting not work, because they prove that there’s no such thing as a typical month. This is also why you cannot rely on velocity to plan far ahead: there is no such thing as a typical project or even a typical iteration. If velocity doesn’t bring a project under control, then what might? For an answer, I invite you to consider personal finance again.

I read Your Money or Your Life to help me bring my personal finances under control and become financially free. One of the key exercises the book asked me to perform was to track all money coming in and out of our household for several months. This exercise is designed to call attention to our spending habits, so we can decide what to do about them. In particular, we identified those expenses we valued and those we didn’t value, so we could stop spending money on things we don’t value. This tactic, spending money only on things we value, doesn’t necessarily make one rich, but it ensures that one is not wasting money on things one doesn’t actually value. You’d be surprised how much money we were throwing away on things that didn’t ultimately make our lives any better, and you’d be amazed at the results when we stopped: we used that money as capital to generate passive income, and within five years, we’ve become financially free. But I digress. The point is what we measured, then how we analyzed it.

We measured actual spending, rather than estimated future spending or even estimated past spending. When we had the numbers, we didn’t start slashing expenses we cared about, and we didn’t start looking for cheaper alternatives to all our expenditures. Instead, we simply looked for those expenses we did not value, then cut them off. In some cases, this required investment of money: we bought a coffee grinder, an AeroPress, then started brewing coffee at home, rather than spending $2 per on coffee at our local coffee shop. In some cases, this required investment of time: we reviewed our work commitments and changed the way we work so that we wouldn’t be so tired so often, which meant we no longer “needed” to order dinner in as often as we did. We measured our actual spending, we looked at those numbers to identify waste, then we eliminated it. We did not waste effort staring at our budget, wondering how accurate it was, wondering how realistic it was, desperately trying to spread not enough money over too many expenses. We didn’t have to: we had actual expense numbers and could decide which items we valued and which we didn’t. More to the point, this exercise was far more instructive and effective than budgeting ever was.

Does your company value the areas where your software team spends its money? Do you know how to measure the value of what your team produces? How much money (time, energy, or actual cash) does your team spend on things no-one values? Could you pick three such wastes and eliminate them? What would you do with the recovered time, energy, or cash? How would you know that things got any better?

Forget velocity for the time being. Measure it and report it to whoever still wants it, but just between you and me, forget velocity and focus on these key questions. Try it for a few months. Share with us what happens. I’d love to hear from you.

« Previous 1