We’ve all encountered unreliable software in products. Your smartphone crashes, your GPS doesn’t work the way it should. This can be very annoying, but often all it takes is a reboot, or trying again. A source of irritation, and we don’t really worry about it too much. But imagine that you have to make an emergency stop in your car and the brakes don’t respond when you hit the pedal. Then unreliable software suddenly becomes life-threatening. Infamous past examples of unreliable software have been the Ariane 5, or the Therac-25 radiation therapy machine. A more recent example was the New Horizons Pluto Probe mission, which experienced serious problems just before arriving at Pluto, so that the software no longer responded. Unfortunately rebooting is not an option.

Bryan Bakker

Bryan Bakker

How do you measure software reliability?

We can quite easily imagine unreliable software. But what in fact is software reliability? Take an airbag as an example. Fortunately, I’ve never needed to use one, even though I’m in the car almost daily and cover huge distances every year. But does that make the airbag reliable, because after all, it has worked fine up to now? There are plenty of different definitions of software reliability in use, and the most common one is:

Software Reliability is the probability of failure-free software operation in a specified environment for a specified period of time.

You should certainly apply this definition within the context of the product: for an airbag (and its required software), the ‘specified time’ is not that relevant, for instance. It’s more relevant to consider the number of times that the airbag is actually used, although that can certainly again be expressed in time. And the reliability of car engines (and other components) is often expressed in mileage covered, for instance.

Once it’s clear what software reliability means in the context of the product, the required actions must be defined (and executed). Consider specifying clearly what the expected reliability must be (often expressed as MTBF – Mean Time Between Failures), what measures are needed in the software architecture, and what tests should be defined and performed. Perhaps certain parts of the software were not developed in the traditional way, but were generated through MDSD – Model Driven Software Development (described in a previous blog by Michaël van de Ven). It’s also possible to measure software reliability during its development, and to create so-called Reliability Growth Plots to report the progress clearly. Extrapolation then enables an estimation of the product’s eventual software reliability to be made, once it is released. This prevents the actual reliability coming as a surprise.

Towards higher software reliability in four steps

There’s literature available on software reliability, based mostly on John Musa’s fundamental ideas. However it’s not easy to apply this theory in practice. Certainly, if an organisation wants to demonstrate inherent software reliability for the first time, there will be more questions than answers. There has to be another way...

Based on our own practical experience, I joined Peter Wijnhoven, Rob de Bie and René van den Eertwegh in writing a book with a four-step approach to higher software reliability.

Using four steps (and of course the required sub-steps), it becomes clear what actions are needed to get a grip on the software reliability of complex products. Here’s a quick foretaste:

  1. In the first step, the end-user’s domain is analysed, leading to describing the reliability requirements for the various functions the system offers.
  2. The software’s reliability requirements are then determined, and so-called operational profiles are then derived from this for the software’s various components.
  3. In the third step the engineering process is fine-tuned to achieve the desired reliability; after all, the reliability shouldn’t be too low, but neither should it be (much) too high.
  4. The final step describes how software reliability can be measured and reported. These steps are undertaken continuously in an agile environment.

Software reliability is a choice

Organisations which really do want to get software reliability under control first have to understand exactly what that means in their own context, and must then determine and perform the required actions. Naturally this takes time and money, but just like the functionality of your product... it’s a choice! If you decide to focus more on new features and less (or not at all) on reliability, then you could ultimately face unfortunate surprises.

What a pity it would be for end-users to be the first to encounter disappointing software reliability. Because after all, you only get one chance to make a good first impression.

Read more:

Bryan Bakker (Sioux), Peter Wijnhoven (Sioux), Rob de Bie (Key Consult), and René van den Eertwegh (Altran) are joint authors of a book describing a four-step approach: Finally... Reliable

Software!: A practical approach to design for reliability. The book is available from Amazon and others.  

The theory is described in the book’s first half. The second section has a very extensive case study where the steps are applied. The case study is based on the authors’ experience.