As long as I’ve been in the profession, “the software is always late.” Do you ever wonder why we can’t be on time? It’s not because we can’t; it’s because we don’t want to be. By the way, the same is true of public works projects, kitchen remodeling, Ph.D. theses, sports stadiums, and almost every other large effort. You hear about movies or buildings that are on time and under budget and assume that’s always attainable, when in fact it’s a best-case. A unicorn, if you will.
The Wisdom of Crowds
Here’s a video by James Surowiecki, author of The Wisdom of Crowds.
Software project estimation has been the subject of innumerable articles and books, going back to the dawn of computing. Let’s just query Amazon Books for “scheduling software projects” and we get pages of results. If we google “why is the software always late” we get “About 391,000,000 results” (which doesn’t mean they’d actually deliver 391M hits to you if you kept scrolling).
“The software is always late,” say the electrical engineers. They think it’s a moral failing on the part of the software guys.
Well, it is, but not the way they think. It’s a moral failing because people (and by “people” I mean everyone) refuse to be realistic about it. The same is true about other large projects.
A Small Experiment That Failed
After reading The Wisdom of Crowds I had the idea that many people together possessed the wisdom about when a chunk of software would be finished, as long as their judgments really were independent of each other. On or about August 1 one year, we had a particular milestone whose “official” date was September 1. I convinced our manager to let each team member write on a piece of paper, without their name, the date they thought it would actually be done. Then I collected the papers and averaged the dates.
Fail! All of the guesses were around September 1. That’s when they were told to have it done, and thus, that’s when they expected to have it done.
What does this prove? Nothing. To really do this, you would get people from over the company, 100’s of them, and have them estimate it before it even starts. This is more or less what the people in Surowiecki’s county fair story did: many fair-goers estimated an animal’s weight, and the average was almost exactly right. But they weren’t influencing each other, and thus their guesses really were independent. Each of them had some bit of insight, and the average was correct.
Why will that never happen in real life? Because senior management doesn’t want to hear that their estimates are foolish and no one believes them. You get accused of being a Debbie Downer if you say that.
The Planning Fallacy
Now I’ll tell you a method that would work. Very few managers will ever want to do this one, either. Why? Because people tend to feel that the more you know about your project, the better you can predict it. This is an obvious belief. It’s wrong, though.
In Thinking Fast and Slow, Nobel laureate Daniel Kahneman tells of an effort in Israel he was involved with, to develop a new curriculum and write a textbook for it. He started like I did:
I asked everyone to submit an estimate of how long it would take us to submit a finished draft of the textbook to the Ministry of Education. I was following a procedure that we already planned to incorporate into our curriculum: the proper way to elicit information from a group is not by starting with a public discussion but by confidentially collecting each person’s judgment. This procedure makes better use of the knowledge available to the group than the common practice of open discussion. I collected the estimates and jotted the results on the blackboard. They were narrowly centered around two years; the low end was one and a half, the high end two and a half years.
But then, Dr. Kahneman went one step further and asked Seymour, their curriculum expert, if he could think of other teams that had done something similar, namely, designed a curriculum from scratch. Seymour was familiar with quite a few. Kahneman asked him how long those teams had taken from this point in the project.
Seymour took a long time to answer, and blushed when he did. He realized that a substantial number of similar teams had failed to finish at all, almost 40%. The team found that disturbing, since they’d never even considered the possibility of failing.
Kahneman asked Seymour, “Of the teams that succeeded, how long did it take them?” Seymour could not think of any team that had finished in less than seven years, or more than ten.
He got desperate; maybe these groups were less skilled: “When you compare our skills and resources to those of the other teams, how good are we? How would you rank us in comparison with these teams?”
Seymour answered more quickly this time. “We’re below average, but not by much.” Gloom was universal! A 40% chance of failing, and seven years if you succeed.
Kahneman said,
The statistics that Seymour provided were treated as base rates normally are — noted and promptly set aside.
We should have quit that day. None of us was willing to invest six more years of work in a project with a 40% chance of failure.
… After a few minutes of desultory debate, we gathered ourselves together and carried on as if nothing had happened.
The project was eventually completed eight years later, and by that time, the Ministry of Education had lost its enthusiasm and the textbook was never used.
The baseline prediction
There is much more in this chapter of Thinking Fast and Slow than I can possibly do justice to. Briefly, a baseline prediction is, “The prediction you make about a case if you know nothing except the category to which it belongs.” If you’re asked to judge the height of a woman who lives in New York, and you have no other information, you would pick the average height of New York women. If now you’re told that her son is the center for the Knicks, you might revise your baseline estimate upwards.
His curriculum group disregarded the baseline because they had specific information on their case. Results on their project had been good so far, so why not just extrapolate them to all the work? He goes on to note that doctors and lawyers get irritated when asked for statistics. “Every case is unique,” they insist. For nearly every software project I was ever on, everyone would have said it was unique and nothing like it had ever been done before.
Kahneman and his then-partner Amos Tversky called “the Planning Fallacy” any plans that:
Are unrealistically close to best-case scenarios.
Could be improved by consulting the statistics of similar cases.
Bent Flyvbjerg is a Danish “economic geographer” who outlined a planning method as follows:
Identify an appropriate reference class (kitchen renovations, large railway projects, etc.)
Obtain the statistics of the reference class (in terms of cost per mile of railway, or of the percentage by which expenditures exceeded budget). Use the statistics to generate a baseline prediction.
Use specific information about the case to adjust the baseline prediction, if there are particular reasons to expect the optimistic bias to be more or less pronounced in this project than in others of the same type.
Reference class estimates go Hollywood
(Why are they ecstatic that they got a 5, on a scale of 1-10?)
In this scene of Silver Linings Playbook (starting at 5:50), Pat Sr. and his friend Randy are mulling over a bet that Randy won, and Randy has an idea for a bet to make it a parlay. He knows that Pat and Tiffany (Bradley Cooper and Jennifer Lawrence) are going to compete in a dance contest.
Randy asks Pat about “this dancing competition” and how they score it. Randy knows nothing about dancing.
The key here, paradoxically, is not to know more about the subject; it’s to know less. Randy finds out what he really needs to know:
Randy: How do they run this dance competition? I mean, how do they score it and everything?
Pat: I don’t know. I don’t know how they fuckin’ score it. We’re participating, we’re not, we’re not a part of it, there’s people that are high end, it’s a high end dance contest, I don’t know. Do not put it as part of the parlay, Randy!
Tiffany: By the Philadelphia rules, each dancer is scored on a scale of 1 to 10, 10 being the highest. You have to average the four judges’ scores.
Randy: OK. The score is 1 to 10, right? And you guys are how good?
Pat: We suck.
Tiffany: We don’t suck. Pat’s a beginner, I’m OK, we’re happy just to be going there.
Randy: And how are the people you’re competing against?
Tiffany: They’re good. Some of them are professionals.
Randy: Better than you…
Pat: A lot better.
Randy: A lot better. So, if I were to say you only have to score 5, I would be really very generous, right?
Pat: No, no, that would be amazing if we got 5.
Tiffany: Oh, come on! We can get a 5 out of 10. Give me a break! Give me a break!
All of them yell over each other.
Pat Sr. and Randy end up shaking hands on the bet.
Was that bet foolish on Randy’s part? No, he determined the key facts of the bet, and indeed, Pat and Tiffany do barely manage to score a 5 in the competition.
This is Hollywood, but just like Moneyball and The Big Short, it’s funny because it illustrates something profound: you don’t have to know everything about something to bet on it intelligently. And we could probably find ten options traders who can do a better job of estimating your software project than the average project leader can.
How would this work in software?
“OK, that sounds great for building a subway or renovating a kitchen, but those projects are all pretty similar. How do you pick a reference class for a software project? And how do you know what the relevant statistics are?” is probably what you’re asking. And you’re right to ask.
What class is this?
Let’s say you’re an IT manager in a consumer goods company, and your project is a new customer-relations package for your marketing, sales, and support people to use. It’s also to be used by the website. You’ve picked Salesforce. Your CEO is asking you, as they do, ‘How long is this going to take?”
Well, there are thousands of similar projects. You’re hardly the first company to do this. Let’s go to Google and ask, “how long do salesforce projects take.” Results.
Surprisingly, the first result IS a “base case estimate”:
On average, it takes a business three to four weeks to implement Salesforce. However, this can vary significantly based on an organization's needs and goals. If you require a complex setup with lots of customization, your Salesforce implementation could take months to complete.
However, most of the search results are of the flavor of “it depends. Every case is unique.”
Blondie, one of the top results, is trying to sell you Salesforce consulting.
Salesforce implementation could take somewhere from three weeks for an out-of-the-box setup. And could take months for a more complex setup. Overall Salesforce implementation timeline depends on several factors:
[then it lists the variables that affect it]
On Quora, the first answer is:
Probably, I may disappoint you but your Salesforce implementation may take from a month to three years. Or may never end at all.
Another answer is a best-case:
With minimum requirements and full customer engagement on C-level and down to user, it can be as short as two weeks.
Are these helpful to you as a manager? No. What you want is a distribution of the times, and you should then take the median, or the mean. Or maybe you want several distributions, one for each class of project. You don’t want to know how long it should take if everything goes right; you want to know how long it will take with a “normal” amount of things not going right. Because there’s always something that goes wrong.
Salesforce does have this information, in one form or another. Are they going to give it to you? You already know their answer: “it depends.”
Getting the reference class statistics
Do you want to know what companies have this information, at least on their own projects? Almost all modern companies. I’ll use Google as an example since I’m most familiar with it, but some variant of this would apply to almost any large company.
Every project has an electronic “footprint” that can be tracked to the minute (these steps may not be complete or in order, and there are many others):
Some kind of project directory is created. At Google, it would be the top-level directory under google3/. Possibly cost centers are created, although that doesn’t always take place.
Personnel are assigned to the project
Email groups are created, for developers and users
Source code files are checked in. Lots of source code.
Email is exchanged. Lots of email.
Production directories are checked in
A “version” is declared
The SRE (site reliability engineers) review and pass the code
The system is put into production, whether that’s as a product, or internally to the production team
The system is announced to the public, or the internal users
The email traffic among developers tails off
Source code checkins tail off
Personnel leave the project
Many more
On open source projects, e.g. on GitHub, there are similar milestones that can be tracked.
The definition of when a project is deemed to “start” and “be completed” will naturally be subjective, but it needs to be consistent, so that someday the projects from different organizations can be comparable.
It would be a good idea to discretize the project length to weeks, to smooth out random variations.
Finding the classes
It’s fashionable to say “AI can solve this,” and if that makes you happy, go ahead. I think regular old statistics is perfectly adequate for it, though.
Once you have this massive database of how long projects take, you’d graph it, where the x-axis is project length, and the y-axis is how many projects took that number of weeks (if “weeks” is our unit). Would it look like this (ignore the numbers)?
Or this?
In the second graph, a substantial number of projects never finish, so we might indeed have a peak at the right end.
My guess is that there would be peaks in the graph, and those would suggest the classes of project. On the other hand, maybe some projects that look very simple at the beginning end up taking forever. We’d find that in the next step.
Defining the classes
Once you have the proposed classes, you go deeper and see what defines those classes. What features of the project were visible at the start for each class, and were not all the same for any other class: was it the number of people assigned, the number of projects it depended on, the number of projects depending on it, the estimate of lines of code, the size of the code base it had to fit into, the budget assigned to it, or some other variable?
The end goal is to get a distribution of actual project lengths for each class of a given type.
I don’t actually know if choosing a class for a new project could be automated, or if you’ll ever be able to input those numbers and out pops your project’s class. I think determining the class of a new project might be a human process.
If we were getting the classes for kitchen remodeling jobs, the variables would be things like :
walls to move
cabinets to reconfigure
type of wood for the cabinets
countertops
appliances
plumbing changes
electrical changes
flooring changes
square feet of new kitchen
is there an island or not
For a kitchen job, the goal is not to use every single variable with the illusion that we can then predict cost to the penny. The key would be to classify the job into small / medium / large (or more classes) and determine how long that class actually takes, and what it ends up costing. Thus, that “unexpected” delay when the plumber got sick is factored in, because something like that always happens.
And finally…
Now that we have a class of project and a distribution of the times that those projects have taken, the mean or the median becomes our estimate. We can make adjustments based on our unique characteristics, but we don’t get to say, “Hey, we’re better, we’re not going to make changes once we start, and it should only take four weeks.”
Management probably won’t like that answer. But as Bent Flyvbjerg tells us, that’s our answer.