Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
    • About Us
    • Colophon
    • Survey
  • Reliability.fm
  • Articles
    • CRE Preparation Notes
    • NoMTBF
    • on Leadership & Career
      • Advanced Engineering Culture
      • ASQR&R
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Maintenance Management
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • RCM Blitz®
      • ReliabilityXperience
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Field Reliability Data Analysis
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability by Design
      • Reliability Competence
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
      • Reliability Knowledge
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Institute of Quality & Reliability
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Statistical Methods for Failure-Time Data
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Glossary
    • Feed Forward Publications
    • Openings
    • Books
    • Webinar Sources
    • Podcasts
  • Courses
    • Your Courses
    • Live Courses
      • Introduction to Reliability Engineering & Accelerated Testings Course Landing Page
      • Advanced Accelerated Testing Course Landing Page
    • Integral Concepts Courses
      • Reliability Analysis Methods Course Landing Page
      • Applied Reliability Analysis Course Landing Page
      • Statistics, Hypothesis Testing, & Regression Modeling Course Landing Page
      • Measurement System Assessment Course Landing Page
      • SPC & Process Capability Course Landing Page
      • Design of Experiments Course Landing Page
    • The Manufacturing Academy Courses
      • An Introduction to Reliability Engineering
      • Reliability Engineering Statistics
      • An Introduction to Quality Engineering
      • Quality Engineering Statistics
      • FMEA in Practice
      • Process Capability Analysis course
      • Root Cause Analysis and the 8D Corrective Action Process course
      • Return on Investment online course
    • Industrial Metallurgist Courses
    • FMEA courses Powered by The Luminous Group
    • Foundations of RCM online course
    • Reliability Engineering for Heavy Industry
    • How to be an Online Student
    • Quondam Courses
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home
  • Barringer Process Reliability Introduction Course Landing Page
  • Upcoming Live Events
You are here: Home / Articles / Beware of the Mean Time Between Failure Calculation Trap

by Mike Sondalini 2 Comments

Beware of the Mean Time Between Failure Calculation Trap

Beware of the Mean Time Between Failure Calculation Trap

An MTBF calculation is often done to generate an indicator of plant and equipment reliability. An MTBF value is the average time between failures. There are serious dangers with the use of MTBF that need to be addressed when you do an MTBF calculation.

Take a look at the diagram below representing a period in the life of an imaginary production line. What is the MTBF formula to use for the period of interest to represent the production line’s reliability over that time?

If MTBF is the ‘mean time between failure’ (MTBF applies to repairable systems; MTTF, Mean Time To Failure, applies to unrepairable systems) the MTBF formula would need to have time units in the top line and a count of failures on the bottom line.

In the diagram you will see the MTBF formula that I finally settled on: Mean Time Between Failure (MTBF) = Sum of Actual Operating Times 1, 2, 3, 4 divided No of Breakdowns during Period of Interest.

But the MTBF value you get from that MTBF calculation changes depending on the choices you make.

To arrive at a MTBF equation there are assumptions and options to consider. Like, what event is, and is not, a ‘failure’? What power-on time do you consider to be equipment operating ‘time’? When do you start and end the period of interest for which you are doing the MTBF calculation?

Definition of Failure

To measure MTBF you need to count the failures. But some failures are out of your control and you cannot influence them, like lightning strikes that fry equipment electronics, or floods that cause short circuits, or if your utility provider turns off the power or water supply. Do you include Acts-of-God into your MTBF calculation?

In reliability engineering a ‘failure’ is considered to be any unwanted or disappointing performance of the item/system being investigated. That definition leaves ‘failure’ wide-open to interpretation.

Is a ‘failure’ only ever a breakdown? Is a ‘failure’ anytime the production line stops no matter the cause? Is a power black-out caused by the utility provider a ‘failure’ you should count? Is an operator error that stops production but does no other harm a ‘failure’? Should you include all types of failures in your MTBF calculation—that will give you a short MTBF value? Or do you remove certain categories of stoppages when using a MTBF formula—that will give you a longer MTBF value? But which categories do you and don’t you count?

If in the imaginary production plant timeline modelled above you included ‘Forced Outage 1’ along with the two breakdowns in the MTBF calculation you would get a MTBF one-third lower. That is a substantial impact on the MTBF value.

To make sense of a MTBF calculation you need to know what ‘failures’ are included and which ‘failures’ are not. And you also need to understand why those choices were made.

Definition of Operating Time

When is an item of plant or equipment operating?

Equipment parts are degraded by the applied stresses put on their atomic structure. The greater the stress suffered, the greater the resulting impact on the item’s operating life. When a vehicle is stopped at red traffic lights the engine is running under the least working load. The gearbox and the rest of the drive train are not in use. When parts are under no stress their atomic structure suffers no harm. When equipment working assemblies are at least stress their parts last longer. For the MTBF calculation of the vehicle do you include its idle times, or just the times it carries sufficiently high working load that causes stress in the parts?

Would you consider the ‘equipment operating time’ for the MTBF calculation as any time it was turned on, or only when it was suffering under working loads? If in the MTBF formula you included all operating time from when the vehicle started, and not only when the parts were under working stress, your MTBF value would be higher. But that Mean Time Between Failure value would not be representative of those vehicles that are continually working and hardly ever idling.

You cannot use MTBF as an indicator to compare the same equipment model, assembly number or part number if they are suffering under different working situations.

To make sense of a MTBF calculation you need to know the specific situation and operating scenarios being measured.

Selecting the Time Period

Because you count ‘failure’ events during a period of time in your MTBF calculation the period of interest selected affects the resulting MTBF value.

In the above timeline the period used in the MTBF reliability analysis is through to the end of the second breakdown. If I had chosen to make the time period through to the end of ‘Operate 4’ the second breakdown would not be counted in the MTBF calculation and I would double the MTBF value. By altering the date to exclude one failure event I doubled MTBF—see what magic you can do with MTBF.

Notice how the two equipment breakdowns are well to the right hand side end of the timeline. Even though the first breakdown happened long into the period of interest, the MTBF for the period does not recognise the dates of those failures. The MTBF was outstanding performance up until the first breakdown, then it dropped, and it dropped again with the second breakdown. A MTBF calculation presumes ‘failures’ are distribute evenly across the period, even though that is not the real historic truth. A MTBF value is hardly ever honest about what actually happened.

To make sense of a MTBF calculation you need to know the time period selected. You also need to know why that duration was used and not some other period.

Selecting the Equipment to Monitor

One more issue to consider with regards MTBF is whether you measure a whole process or measure individual equipment within a process. A complete process suffers MTBF loss every time one of its critical items ‘fail’. If you have a problem piece of plant that brings down the MTBF performance of the whole process, the ‘bad actor’ needs to be flagged as the performance destroying cause.

The companies who take the whole line/process into MTBF calculations often struggle to get a high MTBF due to ‘bad actors’ failing within the system being monitored. Those companies also need to measure individual equipment MTBF to identify the problem plant so its failure causes can be addressed and the ‘bad actor’ made more reliable.

How to Protect Yourself from the MTBF Calculation Trap

MTBF calculations are a statistical trap easily fallen into. A MTBF value can be a total fabrication. Some Managers remove or add all kinds of MTBF parameters to make their department look good (like not counting ‘failures’, changing period lengths, and the like). But that is a falsehood. I once came across a company that did not count stoppages less than 8 hours duration in their MTBF calculation. They weren’t ‘breakdowns’ but they were forced outages over which they had full control. What a joke I thought. What an absolute rubbish way to run a business. You can never improve a company if people tell lies about its performance and hide the truth of where the troubles lay.

You need to get agreement across the company as to what can be called a ‘failure’, what can be called ‘operating time’ and what are the end points of the time period being analysed before you can use MTBF values as a believable Production Reliability KPI (Key Performance Indicator).

Maybe it is more sensible to have MTBF by categories, e.g. 1) mean time between machinery/equipment breakdowns caused by internal events, 2) mean time between operator induced stoppages, 3) mean time between external caused outages that you cannot control, like power or water loss, and so on.

Your second best protection against misinterpreting and misunderstanding MTBF is to have honest, rigid rules covering the choices and options that arise when doing a MTBF calculation.

The very best protection is to also get the timeline of the period being analysed showing all the events (and their explanations) that happened, and then ask a lot of questions about the assumptions and decisions that were made, and not made, to arrive at those MTBF values.

All the best to you,

Mike Sondalini
Managing Director
Lifetime Reliability Solutions HQ

Filed Under: Articles, Maintenance Management, on Maintenance Reliability

About Mike Sondalini

In engineering and maintenance since 1974, Mike’s career extends across original equipment manufacturing, beverage processing and packaging, steel fabrication, chemical processing and manufacturing, quality management, project management, enterprise asset management, plant and equipment maintenance, and maintenance training. His specialty is helping companies build highly effective operational risk management processes, develop enterprise asset management systems for ultra-high reliable assets, and instil the precision maintenance skills needed for world class equipment reliability.

« Big Data and the Quality Profession
What Should We Use Instead of MTBF? »

Comments

  1. Abdulrahman Alkhowaiter says

    March 26, 2024 at 10:16 AM

    Great work going into deep details of how MTBF is best measured, and explaining the false statistics we have seen applied before. At least you did not put down the concept of using MTBF as a reliability indicator; It is not perfect but it is the best available reliability measurement criterion regarding Machinery or Instrumentation, Electronics, and Electrical devices.

    Reply
    • Fred Schenkelberg says

      March 26, 2024 at 2:34 PM

      Hi Abdulrahman, please consider the reliability metric very suitable. A probability of failure over a specific duration with associated function and environment. We can estimate or calculate the reliability metric using a wide range of parametric and non-parametric methods.

      I agree with Mike in this article – all too often the MTxx values provided by vendors or even shared within an organization are less then useful even if calculated correctly and consistently. We are not interested in an average value that inherently makes major assumptions related to a constant hazard rate for our equipment. MTxx values rarely, very very rarely reflect the actual pattern of failures.

      We should use the best available data, that may take a little more work and better understanding compared to using MTxx value, yet provides a means to make better decisions.

      cheers,

      Fred

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Headshot of Mike SondaliniArticles by Mike Sondalini
in the Maintenance Management article series

Join Accendo

Receive information and updates about articles and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Posts

  • Gremlins today
  • The Power of Vision in Leadership and Organizational Success
  • 3 Types of MTBF Stories
  • ALT: An in Depth Description
  • Project Email Economics

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy