10/15/2013

Striking a balance between reliability and availability

Can you achieve one without
sacrificing the other?
Maybe it's just me, but a lot of people seem to use reliability and availability interchangeably. I often hear people say 99.999% reliability when, in fact, they are referring to availability.

So what is the difference between the two? And why is that difference important? I'm glad you asked. :-)

In a software-based system, availability refers to how often the system responds to events or stimuli in a timely manner; reliability, on the other hand, refers to how often the responses are correct. The distinction can be a matter of life or death. For instance, in some medical devices, it is preferable to have no response (where little or nothing happens to the patient) than a wrong response (where the device harms the patient irreparably). Whereas in other systems, any response of sufficient accuracy or quality may be preferable to no response at all.

But here's the thing. Regardless of whether a system is more sensitive to availability or reliability, it should still take pre-defined (and carefully considered) actions when a dangerous condition arises. For instance, if the control system for a high-speed train fails, it will move to its design safe state, which will probably involve applying the brakes.

So far, so good. The problem is, many systems are components of larger systems. So even when a component is avoiding a genuinely dangerous situation, its behavior may put stress on the larger system and lower that system's availability.

Moreover, the behavior of an overall system when an unanticipated condition occurs can be very difficult to predict, for the simple reason that the system depends on multiple, largely independent, components moving to their design safe states. None of those components, and their safe states, can be considered in isolation. For instance, in 1993, Lufthansa Flight 2904 overran a runway because the reverse thrust deployment system operated exactly to specification. Unfortunately, the system designers hadn't anticipated conditions during a cross-wind landing.

Enough from me. I invite you read the ECN article "Balancing reliability and availability", written by my colleague and senior software developer Chris Hobbs. Chris discusses how it's possible to strike a balance between reliability and availability — and why designing safe software can require the ability and willingness to think from the outside in.

No comments: