Some of software's darkest failures from recent history

Abhimanyu Grover
July 13, 2016

No matter how big or small a development team is, no software is without bugs. Bugs can be minimized but not isolated. There will be always a case which nobody ever thought in team, one which would present you at worst possible timing.

You don’t see it coming – you’re shocked to see something’s wrong but can’t say what or where. It doesn’t make sense.

We’ve all been in such scenarios before – mysterious unless we find the rationale behind it.

Sometimes these bugs are a result of rare scenarios which would’ve been impossible to think of. Such rare events can also be called as ‘Black Swan events’

A Black Swan event refers to a highly improbable and impactful occurrence which is impossible to predict. Its shock value causes astonishment and disbelief because people can never think of such an event occurring. The concept of black swan events was introduced by the writer Nassim Nicholas Taleb in his book, ‘The Black Swan: The Impact of the Highly Improbable’.

Taleb mostly focuses on the ideas of fragility and antifragility in financial, medical, economic, and political systems. The Black Swan concept dates back several centuries ago and it was based on the mistaken belief that all swans are white. However, in the 17th century, black swans were discovered in Australia, exposing the limitations of our information and imagination.

Taleb’s book doesn’t talk about how we can apply the Black Swan concept to software systems but there are some valuable lessons that the testing community can draw when it comes to testing the performance of software systems.

Software failures result from a variety of causes – mistakes are made during coding and undetected bugs can be in hibernation for a long time before causing failures.

These are some catastrophic failures resulted because of software bugs which nobody could think of:

1. Ariane 5, 1996

On June 4, 1996 the Ariane 5 rocket, which was scheduled to put telecommunications satellite into space, exploded just after lift-off. The European Space agency had spent over a decade for developing the $7 billion rocket and preparing it for its first voyage. The total destruction was valued at $500 million. The inquiry board’s investigation report attributed the failure to a software error in the inertial reference system.

2. NASA’s Mars Climate Orbiter, 1999

On September 23, 1999, NASA’s Mars Climate Orbiter was destroyed in space because engineers failed to make a simple conversion from English units to NASA mandated metric units. As a result, the spacecraft dived too low into the Martian atmosphere, where it could not handle the stresses inflicted by the Martian atmosphere. The stresses crippled it and the spacecraft was hurtled on through space in an orbit around the sun.

3. American Power Grid collapse, 2003

On August 14, 2003, areas of the Northeastern United States and Southeastern Canada experienced widespread power failures that resulted in the shutdown of nuclear power plants and disruption of air traffic in the affected areas. More than 50 million people were affected, and the disruption cost was $13 billion. Investigators found that a maintenance engineer forgot to turn on a trigger that launched the state estimator after a particular interval. Incidentally, a software program that sends an alarm about an untoward incident also failed. To make matters worse, the backup server also failed and as a result the cascaded line failures went unnoticed until the entire transmission system collapsed.

4. AT&T Telephone System, 1990

On January 15, 1990, AT&T’s long-distance telephone switching system collapsed. Sixty thousand people lost their telecommunication service while around seventy million telephone calls went uncompleted. The crash started in a single switching- station in Manhattan. However, it spread station after station across America until half of AT&T’s network had gone haywire. On investigation, the AT&T software engineers found that the crash was because of a bug in AT&T’s own software which enables switches to synch up with each other. Unfortunately, the software bug caused miscommunication between the switches and as a result the entire network collapsed.

5. Iran Nuclear Plant, 2012

Undetected vulnerabilities in software have paved way for scathing cyber-attacks as well. An insider in an Iranian nuclear facility used a USB pen drive containing Stuxnet worm. The worm replicated itself from one machine to another and ultimately crippled the control systems in the nuclear plant thereby sabotaging the entire Iranian nuclear program.

These case studies suggest that no matter how much we estimate and plan, unexpected events seems to throw these plans into chaos. Merely hunting for black swans will not suffice as we are living in an ever changing world full of uncertainty and human fallacies.

There are also number of teams who work on minimizing probability of such events. And these jobs are rising in number as we begin to rely more on technology. Here’s what they do:

Risk assessment

The journey towards a robust and resilient application starts with understanding how software can be fragile, and how to strengthen it for overcoming the fragility.

Failure Brainstorming

Some large companies also have separate teams for Failure Mode Effects Analysis (FMEA) that are used to identify risks and points of failures within as well outside the boundaries of the present environment. FMEA requires right people from different domains to use brainstorming and lateral thinking for identifying all the components, modules, dependencies, limitations that could fail in production environment and eventually lead to the system collapse.

Survival Strategy

Knowing that technology will fail is first motivation to prepare for survival strategies. This means, being prepared for failure and having alternative methods in place to contain the damage.

Fault-Tolerant System Design

Lot of research is going on in designing fault-tolerant systems. Depending on domain, these systems can either shut them off until the issue has been rectified or switch to alternate ‘safe systems’ until they are fixed by manual intervention.

I guess these type of tasks cannot be done in your typical office hours under a deadline. Level of intuition and intelligence required for such brainstorming is just off the charts and may be impossible for many of us. Thinking about all of it reminds me of Einstein’s following quote:

“We cannot solve our problems with the same level of thinking that created them”