Why Big Software Projects Become Disasters, Part One

The fault is in our stars

Jul 10, 2023

This is the first of three articles, based on the famous Shakespeare quote:

"The fault, dear Brutus, is not in our stars,
But in ourselves, that we are underlings

This article is about the impersonal forces that create the disasters. Part Two will be on the more interesting parts: what in our psychology causes them? Part Three will be the fix, based heavily on Bent Flyvbjerg’s work.

“The software is always late,” they say. But the same is true of bullet trains, tunnels, kitchen remodeling, Olympics stadiums, and almost every other large effort. Some highways or buildings are finished on time and under budget and we assume that’s normal, when in fact it’s a best-case. A unicorn, if you will. Big software projects are often really late.

I underscore “big.” Anyone can point to their 5-person software project that got done in 9 months on schedule. Not many can do that for a 5-year project with 2000 people. We’re talking about big projects here. The kind that ruin careers and put companies into bankruptcy. The kind that get Cabinet secretaries fired and satirized on Saturday Night Live:

Yes, public works projects do go bad, but IT projects go really bad. It’s not just a “media narrative.”

It’s easy for “management consultants” to pontificate on how to make it go right. Despite decades of good advice, it keeps going wrong. We’re going to look at why that is. There actually is a body of scholarly literature on this, but relatively little of it has reached the popular press. I’m going to do my part to redress that.

First: How Bad Is It?

We have an embarrassment of riches when we look for “IT project disasters.” I pick only a few examples to avoid boring you, and you probably already know things like this anyway.

State of Maine HR System

A Governing.com page tells us:

The effort, which has sought to replace a nearly 40-year-old system programmed in an obsolete language only one state employee knows how to use, has spanned two administrations and lead vendors, has missed its deadlines three times, and has resulted in legislative oversight hearings, calls for an investigation by the state's watchdog agency, and a multimillion-dollar dispute between Workday and the state that's likely to go to court.
…
The effort to replace the state's HR management system goes back to 2016, when then-Gov. Paul LePage contracted Infor, a New York firm, to replace it for $13.5 million. That company proved incapable of delivering the promised system, and in November 2018 the LePage administration signed contracts with Workday to provide both the software and an implementation team to work alongside state contractors and staff to get the new system launched by Jan. 1, 2020. The price was $15 million.

A 2021 newspaper story says:

The Mills administration plans to cancel a contract and potentially seek repayment of up to $22 million from a multibillion-dollar software company over delays in upgrading the state’s outdated human resource systems.

So: the initial price was $15 million, but the state is demanding a refund of $22 million? And it’s still not usable (as of 2021)?

What Went Wrong?

Two of the former contractors asked that their names not be published because they fear professional repercussions for speaking out. The exception was Ahmadah Afif, a Maryland-based global training consultant who worked on the Workday Maine project from February 2019 to May 2020.

Afif, who has helped states, universities and local governments across the country implement Workday and other complex software systems, said the underlying problem became clear soon after she relocated to Augusta and began working on the project.

" Maine didn't have the right people in the right places, and the rates they were offering were not going to bring the right expertise to Maine," Afif said. "In order to save money, they gave jobs to people with little or no experience. It was like: 'Hey, I'm going to drop you in the fire. Dance for me!'"

Another former state contractor said: "I don't think any of us were qualified for the jobs we had. They were quick to hire whoever they could."

A government agency putting unqualified people on the job? Who could have predicted that? I don’t think that qualifies as a Black Swan.

HealthCare.gov

We’d be remiss not to include the rollout of ObamaCare, since you may have heard of it.

Kathleen Sibelius resigned from her post as Secretary of Health and Human Services under Obama over it, saving him the trouble of firing her. To quote Wikipedia,

The design of the website was overseen by the Centers for Medicare and Medicaid Services and built by a number of federal contractors, most prominently CGI Group of Canada. The original budget for CGI was $93.7 million, but this grew to $292 million prior to launch of the website. While estimates that the overall cost for building the website had reached over $500 million prior to launch and in early 2014 HHS Secretary Sylvia Mathews Burwell said there would be "approximately $834 million on Marketplace-related IT contracts and interagency agreements,"the Office of Inspector General released a report in August 2014 finding that the total cost of the HealthCare.gov website had reached $1.7 billion and a month later, including costs beyond "computer systems," Bloomberg News estimated it at $2.1 billion.

Esimated cost: $93.7 million. Final cost: anywhere from $834 million to $2.1 billion.

What Went Wrong?

There are many, many analyses of this disaster, like the Washington Post, PC Magazine, Harvard B-school, and the government’s own Office of the Inspector General. If you read all these, you just realize it was an immensely complicated problem which the government tried to solve in its usual way, by adding more layers of management and consultancies and review meetings, and ignoring basic management techniques.

There might have been a Black Swan in there, but I don’t think they needed one. The “known known” problems were more than enough; it didn’t require any “unknown unknowns.”

Israel Chemical Limited

Just so it’s not all government projects, here’s one from the private sector.

In 2012, ICL began a journey to put in place a new operating model which would require a new system platform. The company appointed a new CEO to drive synergies and shift the company from being product-focused to market-focused.
ICL chose IBM to implement SAP because of IBM’s proposed approach with a high-level cost of $120M over a 5-year plan. The only other consideration made was for SAP’s proposed approach that would cost roughly $200M and take twice as long.
Within a year, ICL’s Board approved the project despite expected implementation costs growing to a $290M price tag, following the completion of a Phase 0 and Blueprint effort with the same 5-year plan.

So before they even started, the cost had grown from $200M (or maybe $120M, I’m not sure) to $290M.

Testing for critical portions of the design failed badly and significant gaps in functional capabilities were identified. In mid-2016, ICL appointed a new program manager who had significant operational experience. Soon after, ICL decided to delay the go-live even more – a full 15 months after the original target date.
Shortly after the decision to defer the go-live, the CEO resigned from the company. When the Board requested a review of the project, it was told the cost projection had increased to $500M – more than 4 times the cost of the original estimate.
This is when the Board decided to kill the project, terminate the contract with IBM, pull out of the pilot implementation, and write off more than $280M in costs.

Total failure, in other words.

What Went Wrong?

Stuff happened, i.e. Black Swans.

It wasn’t long for commodity prices in the market to fall and put severe profit pressure on ICL. As a move to push forward and capture synergies, the company decided to proceed with two operational changes that likely caused a distraction from the SAP implementation project:

Installing global regional service centers to support HR, Accounting, Finance, Procurement, IT, and Legal.
Changing its European operating model to conduct customer business through a single business entity.

This was followed by a worker strike in Israel at one of the manufacturing plants that was planned to be in the first deployment. The strike was resolved after a few months, but it likely inhibited the project team’s ability to access necessary internal resources. All of these external factors put a strain on the proposed timeline of the project.

Commodity prices changed? A strike at one plant? Who could have predicted that? Maybe those were Black Swans.

As Bad as “Normal” Projects Are, IT Projects Are Worse

Are these horror stories the outliers? Not really. Here’s Dr. Flyvbjerg, who’s collected probably the largest database of big projects in the world. They’re all bad, but IT is worse.

“Project estimates between 1910 and 1998 were short of the final costs an average of 28 percent,” according to The New York Times, summarizing our findings. “The biggest errors were in rail projects, which ran, on average, 45 percent over estimated costs [in inflation-adjusted dollars]. Bridges and tunnels were 34 percent over; roads, 20 percent. Nine of 10 estimates were low, the study said.” The results for time and benefits were similarly bad. And these are conservative readings of the data. Measured differently—from an earlier date and including inflation—the numbers are much worse.
The global consultancy McKinsey got in touch with me and proposed that we do joint research. Its researchers had started investigating major information technology projects—the biggest of which cost more than $10 billion—and their preliminary numbers were so dismal that they said it would take a big improvement for IT projects to rise to the level of awfulness of transportation projects. [italics mine] I laughed. It seemed impossible that IT could be that bad. But I worked with McKinsey, and indeed we found that IT disasters were even worse than transportation disasters. But otherwise it was a broadly similar story of cost and schedule overruns and benefit shortfalls.
That was startling. Think of a bridge or a tunnel. Now picture the US government’s HealthCare.gov website, which was a mess when it first opened as the “Obamacare” enrollment portal. Or imagine the information system used by the National Health Service in the United Kingdom. These IT projects are made of code, not steel and concrete. They would seem to be different from transportation infrastructure in every possible way. So why would their outcomes be statistically so similar, with consistent cost and schedule overruns and benefit shortfalls?

How Would You Keep This From Happening?

Scenario #1: You’re On The Board

Close your eyes: you’re in a boardroom full of sober people in suits, and today’s business is a massive IT project incorporating an outside vendor’s applications package, costing millions, employing outside consultants, and touching nearly every aspect of the business. The CIO is there to advocate for the proposal, having spent months assembling it. He or she has detailed estimates on the cost and schedule, which is projected to be two years. It’s been discussed several times before, and the Board seems eager to rubber stamp it and move on.

What questions should you ask before voting Yes or No? Remember, you have a fiduciary duty to act in the company’s best interests. They do pay for “Directors and Officers Insurance” to protect you from lawsuits, but still, those are unpleasant.

You have a sinking feeling that this will be a disaster, and two years from now you’ll fire the CIO for it. They haven’t begun to plan out all the contingencies.

Laying down on the railroad tracks

You could ask some dignified questions:

Q: What alternatives have you considered to this project?
Q: What are the risks and how have you protected against them?

So far, you haven’t laid down on the tracks, yet. This is just being a responsible Board member. The CIO is all ready for those; you just threw what baseball hitters call a “cookie,” or a pitch they can hit out of the park:

A: We looked carefully at all the alternatives and worked with our users and other stakeholders. We presented the results to the Board last year and it was generally agreed that this is the best alternative.
A: If you look in the Appendix, we’ve carefully considered all the known risks and included plans to avoid or mitigate them.

Or you could ask an extremely rude question:

Q: These projects always seem to run into “unforeseen difficulties.” (as you do the air quotes with your fingers). Why do you think those won’t happen here?

That’s a question that might make this term on the Board your last. You are being a Debbie Downer. No one likes those. The Board doesn’t want to tell the CIO not to do what he’s been planning for a year. It’s too late for that.

Scenario #2: You’re the CIO

Close your eyes again: Now you’re CIO, and it’s your chief deputy who’s been planning this project for a year. He brings it to you so you can take it to the Board.

You know if this turns into a disaster, you’ll be the one fired. He won’t be, probably, although your successor might do it, or he might quit when you leave. What should you ask? First, you tell him a story from Nobel laureate Daniel Kahneman:

The 8-Year Textbook Design

In Thinking Fast and Slow, Kahneman tells of an effort in Israel he was involved with, to develop a new curriculum and write a textbook for it:

I asked everyone to submit an estimate of how long it would take us to submit a finished draft of the textbook to the Ministry of Education. I was following a procedure that we already planned to incorporate into our curriculum: the proper way to elicit information from a group is not by starting with a public discussion but by confidentially collecting each person’s judgment. This procedure makes better use of the knowledge available to the group than the common practice of open discussion. I collected the estimates and jotted the results on the blackboard. They were narrowly centered around two years; the low end was one and a half, the high end two and a half years.

But then, Dr. Kahneman went one step further and asked Seymour, their curriculum expert, if he could think of other teams that had done something similar, namely, designed a curriculum from scratch. Seymour was familiar with quite a few. Kahneman asked him how long those teams had taken from this point in the project.

Seymour took a long time to answer, and blushed when he did. He realized that a substantial number of similar teams had failed to finish at all, almost 40%. The team found that disturbing, since they’d never even considered the possibility of failing.

Kahneman asked Seymour, “Of the teams that succeeded, how long did it take them?” Seymour could not think of any team that had finished in less than seven years, or more than ten.

He got desperate; maybe these groups were less skilled: “When you compare our skills and resources to those of the other teams, how good are we? How would you rank us in comparison with these teams?”

Seymour answered more quickly this time. “We’re below average, but not by much.” Gloom was universal! A 40% chance of failing, and seven years if you succeed.

Kahneman said,

The statistics that Seymour provided were treated as base rates normally are — noted and promptly set aside.
We should have quit that day. None of us was willing to invest six more years of work in a project with a 40% chance of failure.
… After a few minutes of desultory debate, we gathered ourselves together and carried on as if nothing had happened.

The project was eventually completed eight years later, and by that time, the Ministry of Education had lost its enthusiasm and the textbook was never used.

Chances are your chief deputy will be as unfazed by this story as Kahneman’s committee was. He’ll say,“Yeah, that’s one weird anecdote. Five sigmas out, or something in the tail of the distribution. Let’s talk about this project.” Everyone thinks a disaster won’t happen to them.

An aside: “five sigmas out?”

People like to throw around terms like that. The normal distribution, or bell curve, is something everyone learns in Statistics class, and it’s found all over: in the heights of people, SAT scores, rainfall, you name it.

Khan Academy explains it here. Unfortunately, its ubiquity is exaggerated. People assume everything is normal. It isn’t. Karl Pearson, one of the founders of modern statistics, said,

I can only recognize the occurrence of the normal curve – the Laplacian curve of errors – as a very abnormal phenomenon. It is roughly approximated to in certain distributions; for this reason, and on account for its beautiful simplicity, we may, perhaps, use it as a first approximation, particularly in theoretical investigations.

So Pearson himself said it was only convenient to start there. He didn’t say to finish there.

“Tail Risk”

Our hypothetical IT manager also used the term “way out in the tail” to minimize the chances of a disaster. He’s implying it must be a bell curve. In our picture of the normal distribution, the “tail” is that part of the curve out beyond +3 or -3 standard deviations. It’s skinny, you’ll notice.

Along with the “everything’s a bell curve” delusion, people assume events “out in the tail” are unlikely. But they’re not, and “tail risk” is not negligible in the real world.

Distributions other than “normal”

Plenty of real-world phenomena do not follow normal distributions. The Power Law is one of the best known. Personal net worth follows a power law, where Warren Buffett, Elon Musk, and Bill Gates have much more than the bulk of the population; music sales do, too (Taylor Swift’s sales are huge, while most musicians make next to nothing)

As Wikipedia says

most identified power laws in nature have exponents such that the mean is well-defined but the variance is not, implying they are capable of black swan behavior.

The Black Swan — now there’s a term we’ll have to come back to!

How are IT project outcomes distributed, anyway?

Bent Flyvbjerg wrote a scholarly paper where he examined this question in rigorous statistical detail. His team analyzed 5,392 IT projects completed between 2002 and 2014 for cost overruns, and the data (which comes from all over the world) is all converted to 2015 US dollars. It looks like this

You’ll notice that this is not a bell curve, and the right tail (cost overrun) is vastly larger than the left tail (under budget). A significant number of projects were more than 10 times their original budget. This is why the average is so bad: those bad outcomes more than outweigh the projects that came in on budget.

Is this fair? Yes, it is; in fact, it includes information for the Black Swan events, where something unforeseeable happened. You might say, “No one could have predicted X”, but the average does include the times that X occurred.

Flyvbjerg says in the paper:

A study by McKinsey and the BT Centre for Major Programme Management at the University of Oxford reports that on average, large IT projects run 45 percent over budget. What is worth noting in these reports and studies is that some IT projects have very large cost overruns of around 200% or even 400%. From a statistical point of view, based on these numbers it is reasonable to ask whether the probability distribution of IT project overruns follows a Gaussian or near-Gaussian distribution (i.e., normal or near normal distribution), as often assumed, or if it instead follows a power-law distribution.

So when your chief deputy starts getting all statistical on you, make a note for his next performance review: he has only a Statistics 101 view of the world.

Why is IT So Bad?

What makes big IT projects obey a power law and not a bell curve? There’s been scholarly research into that, too. First of all, they are just different. But more importantly, there are psychological and political reasons why the estimation and management are so bad.

In Julius Caesar, there’s a famous quote:

Cassius:
"The fault, dear Brutus, is not in our stars,
But in ourselves, that we are underlings."

The Fault is in our Stars (This Article)

IT projects just have more moving parts. A delay in one part has cascading effects, similar to the way your kitchen renovation gets delayed because the plumber’s order for pipe fittings got delayed. He can’t do his part of the work, so everyone else is waiting around.

Flyvbjerg says:

In sum, building IT systems in which underlying technological components are interdependent requires substantial effort to make these components work together as a whole and a failure in a single component can have impact on other interconnected components. When components are unconnected, a problem in a single component is isolated from the rest of the system. However, in a system consisting of interconnected components that may include software, sensors, and communication devices (like in the IoT), a problem in a single component can lead to chain reactions affecting other connected components. In this research, we focus on interdependencies among technological components in an IT system in explaining substantial overruns in IT projects which can shape the probability distribution of IT project cost overruns.

No, the Fault is in Ourselves (Next Article)

Flyvbjerg and Daniel Kahneman had a spirited debate about whether the main problem is psychology or politics. In the end, they came to agree that both came into play: mainly psychology for smaller projects and mainly politics for larger ones.

In the next article, we’ll look at what is, to me, the more interesting part: why do we keep doing this? It can’t be that there isn’t a book yet on scheduling IT projects, or Gantt chart software, or courses available at the great business schools. There are. There must be underlying psychological and political reasons — in other words, as people and as organizations, we don’t want to do it right. Next time.

E.Z. Prine

Jul 11, 2023

This is fascinating. I knew two people who worked on the NHS IT project to make medical records available across the system, and it sounded like an incredible mess. I'm looking forward to reading the next installment about the political and psychological factors.

Expand full comment

1 reply by Albert Cory

1 more comment...

Life Since the Baby Boom

Discussion about this post