Headshot-color me@jbrains.ca Find out where I'm appearing
« Previous 1 3

Interpreting inaccurate estimates

I ran across this today and thought I’d comment briefly about it. How can you interpret inaccurate estimates? What might it mean when you estimate incorrectly?

I know individuals that beat themselves up over inaccurate estimates. I know teams that beat each other up. I know managers that beat up their teams. I even know a company that awards bonus compensation based on the accurate of their programmers’ estimates. In those environments, inaccurate estimates hurt, so although I don’t like cost estimates in general for most teams most of the time, I recognize that some people must play the game to keep their job. That said, I return to something Ward Cunningham said in the early days about Fit: when we look at test results in a spreadsheet, we can look at patterns of red cells to learn something about our patterns of failure. He drew our attention to two kinds of patterns: systematic failure and sporadic failure.

In the early XP literature, books like Planning XP told us that by measuring velocity we gracefully handle systematic estimate inaccuracy as the “value” of our story points fluctuates over time. Think of it like a floating currency: its value in hours changes slowly over time as our ability to complete points improves. If you prefer to estimate in hours, then after several months, you might notice a bunch of stories whose actual costs were close to some constant multiplier of the estimate. In that case, you would consider multiplying all remaining estimates by that constant until you internalized it and began estimating in the “new scale” out of habit. Even when I cared deeply about cost estimates, I worried little about systematic inaccuracies.

Sporadic inaccuracy hurts more. Since we provide cost estimates to that others might plan, we owe it to them to lower the variance of the actual cost of our work, in order to make planning more useful. I have developed a conjecture over the last few years that changes the question of sporadic estimate inaccuracy:

The actual cost of a story depends on (among other things) the complexity of the story and the current state of the design. The higher our technical debt, the more that cost dominates the cost of the story.

I have noticed various people floating the definition of “technical debt” to suit their needs, so I want to clarify what I mean by “technical debt”:

By technical debt I refer to the latent cost of the amount of rot in the design. I think of technical debt as the cost of the design improvements we need to make in order to feel comfortable adding features or fixing defects in that part of the system.

You can think of technical debt as financial debt: interest-bearing principal owed to another party. In this case, the other party is the system or the project, the principal is the current design and the interest is the extra cost associated with either rescuing or working around the current design. With these definitions established, I can state my conjecture this way:

In systems with high technical debt, the cost of repaying that technical debt dominates the cost of a story.

Since not all stories affects all parts of the design uniformly, we can do a little better:

The cost of a story depends on the complexity of the story and the amount of technical debt in the areas of the design we need to change or extend to deliver the story. Working in areas with high technical debt causes the cost of repaying that debt to dominate the cost of the story.

I think you get the point. From this it follows that in systems with generally high technical debt, the distribution of technical debt effectively determines the variance in the cost of the stories. Sporadic estimate inaccuracy, then, likely has a clear root cause: high technical debt distributed decidedly non-uniformly. This follows logically, because if the system distributed technical debt uniformly, then our estimates would show systematic inaccuracy.

This allows me to make two broad claims:

  1. In general, if a system has high and sporadic technical debt, we’ll tend to estimate with sporadic inaccuracy even if we estimate the relative complexity of those stories with perfect accuracy.
  2. In general, we should estimate with at most systematic inaccuracy when delivering stories for a greenfield system or component.

I can interpret these claims more pithily:

  1. If your estimates suck when adding features to a legacy system, blame the shittiness of the codebase.
  2. If your estimates such when adding features to mostly clean code, blame the shittiness of the programmers.

In particular, if you feel bad because your estimates suck when adding features to a legacy system, you can relax. As you attempt to build those features with high discipline, the resulting volatility in your actual costs will come from the current state of the design. The design will actually try to convince you to cut corners. Cutting corners can only improve the accuracy of your estimate for the current story at the expense of the remaining stories. Cutting corners can only get your project manager off your back for a day or two. You will eventually need to stop cutting corners.

Part 2: Some Hidden Costs of Integration Tests

Read more in this series

I fear that this first article in the series may be attacking a view that very few people hold: the idea that one should test all code paths by integration testing alone. — Dan Fabulich

When I tell TDD practitioners my opinion about integration tests, some treat my position as a straw man. They point out that “no one” seriously tries to test entire systems exclusively with integration tests. While I understand their reaction, I need to point out that I never made that claim. I see far more damaging behavior in teams that practise TDD: they duplicate a sizable amount of their effort by designing their objects with thorough focused tests, then adding a suite of integration tests that verify a substantial amount of the same behavior. I understand why they do it. I used to do it. And I want them to stop.

Every integration test costs… well, I don’t know how to accurately say how much it costs. After computing the superficial cost of writing and maintaining the test, I quickly lose track of the varying effects of writing integration tests in place of, or even in addition to, focused object tests. I can compute the raw execution time tax on integration tests: an average focused test executes in 4 ms, while an average integration tests takes closer to 100 ms. I feel comfortable estimating the difference at a more conservative order of magnitude base 10. Beyond that, I find myself lost in the implications of writing integration tests to form a clear picture of the cost. Let me give you an idea of what I mean.

A Tale of Two Test Suites

Consider two test suites. One executes in 6 seconds, and the other in 1 minute. Pretend they cover the same code equally well. I mean that they have the same power to uncover mistakes in the system. Now imagine yourself writing code and executing the 6-second suite. You make a handful of edits, then you run the tests. What do you do for 6 seconds? You predict the outcome of the test run: they will all pass, or the new test will fail because you’ve just written it, or the new test might pass because you think you wrote too much code to pass a test 10 minutes ago. In that span of time, you have your result: the tests all pass, so now you refactor. You probably needed about 6 seconds to read up to here.

Now imagine you run the 1-minute test suite. Once again, you predict the result, during which time 6 seconds pass. If you work alone, then after 8 seconds you’ve started drumming your fingers on the desk or letting your eyes dart around the room. You notice the long list of tasks on the team task board. You start to feel your stomach rumble, noticing the time: 11:42. Time for lunch soon. You wonder what the cantina has for lunch, so you point your browser at their intranet site. Tilapia sounds good. You wonder whether Lisa will join you for lunch, so you switch to your email client. Before you write her, you notice a notification to pay your credit card bill. You can do that in 30 seconds, so you switch back to your browser to log in to online banking and quickly make a payment. It turns out Lisa has a lunch meeting, and you reconsider your choice of fish. Today, you decide, feels like a burger day. In the time you imagined yourself doing that, assuming you guessed how long it took to actually do what you imagined, over 1 minute passed. The computer has spent valuable computing time waiting for you.

Pairing doesn’t seem to solve this problem. If you ran this test suite during a pair-programming session, then you probably spent time chatting. At first, you discussed the recent test. After a while, you discussed the task. That killed about 40 seconds, so you started drifting to other topics: the weekend, the kids, XBox, Battlestar Galactica, baseball, management… then you turned around to notice the test run finished while you were arguing whether Cliff Lee deserved the Cy Young award. I don’t mind injecting plenty of relaxed conversation into my work, but when waiting repeatedly for a 1-minute test suite it doesn’t take long to run out of things to talk about.

I need to point out the dual cost here. The first, we can easily see and measure: the time we spend waiting for the tests plus the time the computer waits for us, because we find it hard to stare at the test runner for 60 seconds and react to it immediately after it finishes. I don’t care much about that cost. I care about the visible but highly unquantifiable cost of losing focus.

TDD works well for me in large part because it helps me focus. When I write a test, I clarify my immediate goal, focus on making it pass, then focus on integrating that work more appropriately into the design. I get to do this in short cycles that demand sustained focus and allow brief recovery1. This cycle of focus and recovery builds rhythm and this rhythm builds momentum. This helps lead to the commonly-cited and powerful state of flow. A 6-second test run provides a moment to recover from exertion; whereas a 1-minute test run disrupts flow. It acts like an annoying short interruption every few minutes. We can try to measure the cumulative effect of these interruptions, but I guess you can imagine a day, possibly a recent one, when periodic short interruptions made it nearly impossible for you to concentrate. How productive did you feel that day? How much did you achieve? How much pressure did you feel to catch up the next day? How relaxed did you feel that evening at home? Did you enjoy dinner? Did you feel present for your spouse or kids or pets? How well did you sleep? How refreshed did you feel the next morning?

Among the early TDD literature I distinctly remember reading that practising TDD would help me focus, relax, achieve more and feel better at the end of a task. I remember agonizing over integration tests. Teams call me expressly to learn how to tame big, slow, brittle test suites. They don’t call me when they feel focused, relaxed and productive. I tell you: integration tests will slowly kill you.

So What Now?

But you have integration tests now, and you haven’t yet learned about the alternatives. How can you cope with your reality? You could regain your focus by running the most important 10% of those tests. That would take 6 seconds and fit into your flow. It also runs a substantial risk of failure. You’ve experienced this. Remember the last time you changed a line of code in one part of the system and it broke something way over there in another module? How did you feel when that happened? How long did you spend tracking down a mistake in some arcane part of they system that perhaps no one understands? How did you deal with having to branch your code changes to deal with the bigger problem? How many times have you told your wild goose chase story to your fellow programmers? How long did you need to recover before returning to a decent state of flow while working on your original task?

So it appears you have a choice between frequent annoying disruptions and less frequent but comparatively catastrophic disruptions. A Morton’s Fork you can blame squarely on integration tests. Stop writing them.

1 For more about the focus/recovery cycle, I highly recommend The Power of Full Engagement

Read more in this series

Toddlers, novelty, and planning

I visited a dear friend this week and had the chance to meet her young son for the first time since before he turned one year of age. I believe he turned four years old this year, and I got to watch him ready himself to leave his pre-school program. Since Toronto still has winter, Kieran had to negotiate a coat, snow pants, boots, mittens, and all the while he was greeting and saying goodbye to his friends, running around getting some additional exercise, and just generally living in the moment. I don’t remember how long it took to leave the building, but it took some time. Melissa turned to me and remarked at how slow toddlers move. I disagree.

A toddler looks slow, but doesn’t move slowly at all. On the contrary, they move quickly from idea to idea, input to input, person to person, thread to thread, processing a ton of information. Novelty dominates the toddler’s life, robbing them of the opportunity to focus, as they dart around to every new piece of input. Their apparent speed, then, comes not from moving slowly, but from trying to do everything at once, responding to novelty and generally having no real plan.

I simply found that interesting. No connection to software. (Wink)

Integrated Tests are a Scam: Part 1

Read more in this series

On March 1, 2010 I changed the phrase “integration tests” to “integrated tests” in this article.

Integrated tests are a scam—a self-replicating virus that threatens to infect your code base, your project, and your team with endless pain and suffering.

Wait… what?

I mean it. I hate integrated tests. I hate them, and with a passion. Of course, I should clarify what I mean by integrated tests, because, like any term in software, we probably don’t agree on a meaning for it.

I use the term integrated test to mean any test whose result (pass or fail) depends on the correctness of the implementation of more than one piece of non-trivial behavior.

I, too, would prefer a more rigorous definition, but this one works well for most code bases most of the time. I have a simple point: I generally don’t want to rely on tests that might fail for a variety of reasons. Those tests create more problems than they solve.

You write integrated tests because you can’t write perfect unit tests. You know this problem: all your unit tests pass, but someone finds a defect anyway. Sometimes you can explain this by finding an obvious unit test you simply missed, but sometimes you can’t. In those cases, you decide you need to write an integrated test to make sure that all the production implementations you use in the broken code path now work correctly together.

So far, no big deal, but you’ll meet the monster as soon as you think this:

If we can find defects even when our tests pass 100%, and if I can only plug the hole with an integrated tests, then we’d better write integrated tests everywhere.

Bad idea. Really bad.

Why so bad? A little bit of simple arithmetic should help explain.

You have a medium-sized web application with around 20 pages, maybe 10 of which have forms. Each form has an average of 5 fields and the average field needs 3 tests to verify thoroughly. Your architecture has about 10 layers, including web presentation widgets, web presentation pages, abstract presentation, an HTTP bridge to your service API, controllers, transaction scripts, abstract data repositories, data repository implementations, SQL statement mapping, SQL execution, and application configuration. A typical request/response cycle creates a stack trace 30 frames deep, some of which you wrote, and some of which you’ve taken off the shelf from a wide variety of open source and commercial packages. How many tests do you need to test this application thoroughly?

At least 10,000. Maybe a million. One million.

Wie ist es möglich?! Consider 10 layers with 3 potential branch points at each layer. Number of code paths: 310 > 59,000. How about 4 branch points per layer? 410 > 1,000,000. How about 3 branch and 12 layers? 312 > 530,000.

Even if one of your 12 layers has a single code path, 311 > 177,000.

Even if your 10-layer application has only an average of 3.5 code paths per layer, 3.510 > 275,0001.

To simplify the arithmetic, suppose you need only 100,000 integrated tests to cover your application. Integrated tests typically touch the file system or a network connection, meaning that they run on average at a rate of no more than 50 tests per second. Your 100,000-test integrated test suite executes in 2000 seconds or 34 minutes. That means that you execute your entire test suite only when you feel ready to check in. Some teams let their continuous build execute those tests, and hope for the best, wasting valuable time when the build fails and they need to backtrack an hour.

How long do you need to write 100,000 tests? If it takes 10 minutes to write each test—that includes thinking time, time futzing around with the test to make it pass the first time, and time maintaining your test database, test web server, test application server, and so on—then you need 2,778 six-hour human-days (or pair-days if you program in pairs). That works out to 556 five-day human-weeks (or pair-weeks).

Even if I overestimate by a factor of five, you still need two full-time integrated test writers for a one-year project and a steady enough flow of work to keep them busy six hours per day and you can’t get any of it wrong, because you have no time to rewrite those tests.

No. You’ll have those integrated test writers writing production code by week eight.

Since you won’t write all those tests, you’ll write the tests you can. You’ll write the happy path tests and a few error cases. You won’t check all ten fields in a form. You won’t check what happens on February 29. You’ll jam in a database change rather than copy and paste the 70 tests you need to check it thoroughly. You’ll write around 50 tests per week, which translates to 2,500 tests in a one-year project. Not 100,000.

2.5% of the number you need to test your application thoroughly.

Even if you wrote the most important 2.5%, recognizing the nearly endless duplication in the full complement of tests, you’d cover somewhere between 10% and 80% of your code paths, and you’ll have no idea whether you got closer to 10% or 80% until your customers start pounding the first release.

Do you feel lucky? Well, do you?2

So you write your 2,500 integrated tests. Perhaps you even write 5,000 of them. When your customer finds a defect, how will you fix it? Yes: with another handful of integrated tests. The more integrated tests you write, the more of a false sense of security you feel. (Remember, you just increased your code path coverage from 5% to 5.01% with those ten integrated tests.) This false sense of security helps you feel good about releasing more undertested code to your customers, which means they find more defects, which you fix with yet more integrated tests. Over time your code path coverage decreases because the complexity of your code base grows more quickly than your capacity to write enough integrated tests to cover it.

…and you wonder why you spend 70% of your time with support calls?

Integrated tests are a scam. Unreliable, self-replicating time-wasters. They have to go.


1 True: few code bases distribute their complexity to their layers uniformly. Suppose half your 12 layers have only two branch points—one normal path and one error path—while the others have 5 branch points. 26·56 = 1,000,000 and for 4 branch points 26·46 > 262,000. You can’t win this game.

2 Aslak Hellesøy points to a way to take luck mostly out of the equation. His technique for choosing high-value tests will certainly help, but it stops short of testing your code thoroughly. I believe you can achieve truly thorough focused tests with similar cost to writing and maintaining integrated tests even using the pairwise test selection technique. (Thanks, Aslak, for your comment on April 12, 2009.)

Read more in this series

On 100% unit test coverage and other nonsensical ideas

I will simply riff for a while on things I read Joel Spolsky say (I read a transcript of a podcast from here) about test-driven development.

Nobody should strive for 100% test coverage, let alone microtest coverage, for obvious reasons. Among those obvious reasons, I find two glaring ones: we generally don’t agree on what the term means; and trying to do it leads to writing tests for their own sake, rather than as a means to write sufficiently correct software. I have learned two important things through practice and observation: no single optimal number for test coverage can exist for all projects; and if you insist on an optimal number for test coverage, choose 85%, meaning that the average team ought to test all but the most straightforward 15% of the average system. When I practise TDD, I end up with around 85% test coverage because of the way I apply the principle of Too Simple to Break. Among the 15% untested you will find dead simple get/set methods and dead simple delegation. When Joel concludes something based on the hypothesis that otherwise thoughtful people have 100% test coverage as a goal, he runs well off the track. I don’t doubt that some people seek 100% test coverage, because those people help keep me in business. Stop it, or I’ll bury you alive in a box.

Some innocuous-looking changes cause an unusual number of tests to fail. While I don’t like this situation, I draw a different conclusion from it than Joel does. I conclude that this points to a design flaw worth exploring. I don’t have a “proof” for this, but I have observed good results when I have treated my own designs this way. In this vain, I follow the maxim I learned from the Pragmatic Programmers: abstractions in code and details in data. Joel uses this example:

Because you’ve changed the design of something… you’ve moved a menu, and now everything that relied on that menu being there… the menu is now elsewhere. And so all those tests now break. And you have to be able to go in and recreate those tests to reflect the new reality of the code.

I don’t understand why we would have microtests that check that a specific menu shows up in a specific location. Remember: abstractions in code and details in data. Also remember: three strikes and you refactor. Putting these two principles together, once I have a few menu items, I’ve extracted the details about individual menus to some List of Menu objects and an engine that operates on them. Also, I have probably separated that List from the code that presents the menus. If I don’t like how my code presents the menus, I can fix that with test data that has no bearing on the actual menus. If I don’t like the order of my menus, I can change the Menu objects in the List—just data—and do a quick manual inspection without having to change the tests for my menu-presenting engine. Moving a specific menu somewhere should not require any code changes; and if it does, then you have a design flaw: you have details in code. Stop it, or I’ll bury you alive in a box.

In general, if a single change causes an unusually high number of tests to fail, then your tests have a duplication problem. Stop letting duplication flourish in your tests, or I’ll bury you alive in a box.

Thanks for Bob Newhart for the line “Stop it, or I’ll bury you alive in a box”, which I heard for the first time on Mad TV.

(Edit 2009-02-23: Corrected Saturday Night Live to Mad TV.)

« Previous 1 3