Fault Tolerance Basics

Fault tolerance is a system that is reliant to the failure of elements within the system. It also may be called a fail safe design.

A fault tolerant system may continue to operate just fine, after one of the power supplies fails, for example. Or it may operate in a reduced or degraded state.

Other systems may have a ‘limp home’ condition, allowing the system to save critical data or allowing you to drive to a safe place to change a flat tire.

There are conditions where an outright system failure is not acceptable.

Communication, banking, air traffic control, transportation, and many other fields have systems where a failure to operate may lead to catastrophic results. Creating a system which may experience component, subsystem, or software failures, and the system is able to continue operation in some capacity it often highly desired.

Fault Tolerant System Basic Characteristics

A fault tolerant system may have one or more of the following characteristics:

No Single Point of Failure

This means if a capacitor, block of software code, a motor, or any single item fails, then the system does not fail. As an example, many hospitals have backup power systems in case the grid power fails, thus keeping critical systems within the hospital operational.

Critical systems may have multiple redundant schemes to maintain a high level of fault tolerance and resilience.

No Single Point Repair Takes the System Down

Extending the single point failure idea, effecting a repair of a failed component does not require powering down the system, for example.

It also means the system remains online and operational during repair. This may pose challenges for both the design and the maintenance of a system. Hot swappable power supplies is an example of a repair action that keeps the system operating while replacing a faulty power supply.

Fault isolation or identification

The system is able to identify when a fault occurs within the system and does not permit the faulty element to adversely influence to functional capability (i.e. Losing data or making logic errors in a banking system). The faulty elements are identified and isolated.

Portions of the system may have the sole purpose of detecting faults, built-in self-test (BIST) is an example.

Fault containment

When a failure occurs it may result in damage to other elements within the system, thus creating a second or third fault and system failure.

For example, if an analog circuit fails it may increase the current across the system damaging logic circuits unable to withstand high current conditions. The idea of fault containment is to avoid or minimize collateral damage caused by a single point failure.

Robustness or Variability Control

When a system experiences a single point failure, the system changes.

The change may cause transient or permanent changes affecting how the working elements of the system response and function. Variation occurs, and when a failure occurs there often is an increase in variability.

For example, when one of two power supplies fails, the remaining power supply takes on the full load of the power demand. This transition should occur without impacting the performance of the system. The ability to design and manufacture a robust system may involve design for six sigma, design of experiment optimization, and other tools to create a system able to operate when a failure occurs.

Reversion state operation (fall-back or limp-along)

There are many ways a system may alter it performance when a failure occurs, enabling the system to continue to function in some fashion.

For example, if part of a computer’s cooling system fails, the central processor unit (CPU) may reduce its speed or command execution rate, effectively reducing the heat the CPU generates. The fail failure causes a loss of cooling capacity and the CPU adjusts to accommodate and avoids overheating and failing. Other reversion schemes may include a roll back to a prior working state, or a switch to a prior or safe mode software set.

In some cases, the system may be able to operators with no or only minimal loss of functional capability, or the reversion operation significantly restricts the system operation to a critical few functions.

Summary

The ability of a system to continue operation despite a failure of any single element within the system implies the system is not in a series configuration.

There is some set of redundancy or set of alternative means to continue operation. The system may use multiple redundancy elements, or be resilient to changes in the system’s configuration.

The appropriate solution to create a fault tolerant system often requires careful planning, understanding of how elements fail and the impact of surrounding elements of the failure.

Deciding What Should Have Fault Tolerance (article)

The Downside of a Fault Tolerant System (article)

Fault Tree Analysis 8 Step Process (article)