Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
    • About Us
    • Colophon
    • Survey
  • Reliability.fm
  • Articles
    • CRE Preparation Notes
    • NoMTBF
    • on Leadership & Career
      • Advanced Engineering Culture
      • ASQR&R
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Maintenance Management
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • RCM Blitz®
      • ReliabilityXperience
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Field Reliability Data Analysis
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability by Design
      • Reliability Competence
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
      • Reliability Knowledge
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Institute of Quality & Reliability
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Statistical Methods for Failure-Time Data
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Glossary
    • Feed Forward Publications
    • Openings
    • Books
    • Webinar Sources
    • Podcasts
  • Courses
    • Your Courses
    • Live Courses
      • Introduction to Reliability Engineering & Accelerated Testings Course Landing Page
      • Advanced Accelerated Testing Course Landing Page
    • Integral Concepts Courses
      • Reliability Analysis Methods Course Landing Page
      • Applied Reliability Analysis Course Landing Page
      • Statistics, Hypothesis Testing, & Regression Modeling Course Landing Page
      • Measurement System Assessment Course Landing Page
      • SPC & Process Capability Course Landing Page
      • Design of Experiments Course Landing Page
    • The Manufacturing Academy Courses
      • An Introduction to Reliability Engineering
      • Reliability Engineering Statistics
      • An Introduction to Quality Engineering
      • Quality Engineering Statistics
      • FMEA in Practice
      • Process Capability Analysis course
      • Root Cause Analysis and the 8D Corrective Action Process course
      • Return on Investment online course
    • Industrial Metallurgist Courses
    • FMEA courses Powered by The Luminous Group
    • Foundations of RCM online course
    • Reliability Engineering for Heavy Industry
    • How to be an Online Student
    • Quondam Courses
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home
  • Barringer Process Reliability Introduction Course Landing Page
  • Upcoming Live Events
You are here: Home / Articles / Fault Tolerance Basics

by Fred Schenkelberg Leave a Comment

Fault Tolerance Basics

Fault Tolerance Basics

Fault tolerance is a system that is reliant to the failure of elements within the system. It also may be called a fail safe design.

A fault tolerant system may continue to operate just fine, after one of the power supplies fails, for example. Or it may operate in a reduced or degraded state.

Other systems may have a ‘limp home’ condition, allowing the system to save critical data or allowing you to drive to a safe place to change a flat tire.

There are conditions where an outright system failure is not acceptable.

Communication, banking, air traffic control, transportation, and many other fields have systems where a failure to operate may lead to catastrophic results. Creating a system which may experience component, subsystem, or software failures, and the system is able to continue operation in some capacity it often highly desired.

Fault Tolerant System Basic Characteristics

A fault tolerant system may have one or more of the following characteristics:

No Single Point of Failure

This means if a capacitor, block of software code, a motor, or any single item fails, then the system does not fail. As an example, many hospitals have backup power systems in case the grid power fails, thus keeping critical systems within the hospital operational.

Critical systems may have multiple redundant schemes to maintain a high level of fault tolerance and resilience.

No Single Point Repair Takes the System Down

Extending the single point failure idea, effecting a repair of a failed component does not require powering down the system, for example.

It also means the system remains online and operational during repair. This may pose challenges for both the design and the maintenance of a system. Hot swappable power supplies is an example of a repair action that keeps the system operating while replacing a faulty power supply.

Fault isolation or identification

The system is able to identify when a fault occurs within the system and does not permit the faulty element to adversely influence to functional capability (i.e. Losing data or making logic errors in a banking system). The faulty elements are identified and isolated.

Portions of the system may have the sole purpose of detecting faults, built-in self-test (BIST) is an example.

Fault containment

When a failure occurs it may result in damage to other elements within the system, thus creating a second or third fault and system failure.

For example, if an analog circuit fails it may increase the current across the system damaging logic circuits unable to withstand high current conditions. The idea of fault containment is to avoid or minimize collateral damage caused by a single point failure.

Robustness or Variability Control

When a system experiences a single point failure, the system changes.

The change may cause transient or permanent changes affecting how the working elements of the system response and function. Variation occurs, and when a failure occurs there often is an increase in variability.

For example, when one of two power supplies fails, the remaining power supply takes on the full load of the power demand. This transition should occur without impacting the performance of the system. The ability to design and manufacture a robust system may involve design for six sigma, design of experiment optimization, and other tools to create a system able to operate when a failure occurs.

Reversion state operation (fall-back or limp-along)

There are many ways a system may alter it performance when a failure occurs, enabling the system to continue to function in some fashion.

For example, if part of a computer’s cooling system fails, the central processor unit (CPU) may reduce its speed or command execution rate, effectively reducing the heat the CPU generates. The fail failure causes a loss of cooling capacity and the CPU adjusts to accommodate and avoids overheating and failing. Other reversion schemes may include a roll back to a prior working state, or a switch to a prior or safe mode software set.

In some cases, the system may be able to operators with no or only minimal loss of functional capability, or the reversion operation significantly restricts the system operation to a critical few functions.

Summary

The ability of a system to continue operation despite a failure of any single element within the system implies the system is not in a series configuration.

There is some set of redundancy or set of alternative means to continue operation. The system may use multiple redundancy elements, or be resilient to changes in the system’s configuration.

The appropriate solution to create a fault tolerant system often requires careful planning, understanding of how elements fail and the impact of surrounding elements of the failure.


Related:

Deciding What Should Have Fault Tolerance (article)

The Downside of a Fault Tolerant System (article)

Fault Tree Analysis 8 Step Process (article)

 

Filed Under: Articles, CRE Preparation Notes, Reliability in Design and Development Tagged With: Fault tolerance

About Fred Schenkelberg

I am the reliability expert at FMS Reliability, a reliability engineering and management consulting firm I founded in 2004. I left Hewlett Packard (HP)’s Reliability Team, where I helped create a culture of reliability across the corporation, to assist other organizations.

« Do Your KPIs Adversely Impact Reliability?
Why Parametric Variation Can Lead to Failures and HALT Can Help »

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

CRE Preparation Notes

Article by Fred Schenkelberg

Join Accendo

Join our members-only community for full access to exclusive eBooks, webinars, training, and more.

It’s free and only takes a minute.

Get Full Site Access

Not ready to join?
Stay current on new articles, podcasts, webinars, courses and more added to the Accendo Reliability website each week.
No membership required to subscribe.

[popup type="" link_text="Get Weekly Email Updates" link_class="button" ][display_form id=266][/popup]

  • CRE Preparation Notes
  • CRE Prep
  • Reliability Management
  • Probability and Statistics for Reliability
  • Reliability in Design and Development
  • Reliability Modeling and Predictions
  • Reliability Testing
  • Maintainability and Availability
  • Data Collection and Use

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy