Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
    • About Us
    • Colophon
    • Survey
  • Reliability.fm
  • Articles
    • CRE Preparation Notes
    • NoMTBF
    • on Leadership & Career
      • Advanced Engineering Culture
      • ASQR&R
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Maintenance Management
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • RCM Blitz®
      • ReliabilityXperience
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Field Reliability Data Analysis
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability by Design
      • Reliability Competence
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
      • Reliability Knowledge
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Institute of Quality & Reliability
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Statistical Methods for Failure-Time Data
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Glossary
    • Feed Forward Publications
    • Openings
    • Books
    • Webinar Sources
    • Podcasts
  • Courses
    • Your Courses
    • Live Courses
      • Introduction to Reliability Engineering & Accelerated Testings Course Landing Page
      • Advanced Accelerated Testing Course Landing Page
    • Integral Concepts Courses
      • Reliability Analysis Methods Course Landing Page
      • Applied Reliability Analysis Course Landing Page
      • Statistics, Hypothesis Testing, & Regression Modeling Course Landing Page
      • Measurement System Assessment Course Landing Page
      • SPC & Process Capability Course Landing Page
      • Design of Experiments Course Landing Page
    • The Manufacturing Academy Courses
      • An Introduction to Reliability Engineering
      • Reliability Engineering Statistics
      • An Introduction to Quality Engineering
      • Quality Engineering Statistics
      • FMEA in Practice
      • Process Capability Analysis course
      • Root Cause Analysis and the 8D Corrective Action Process course
      • Return on Investment online course
    • Industrial Metallurgist Courses
    • FMEA courses Powered by The Luminous Group
    • Foundations of RCM online course
    • Reliability Engineering for Heavy Industry
    • How to be an Online Student
    • Quondam Courses
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home
  • Barringer Process Reliability Introduction Course Landing Page
  • Upcoming Live Events
You are here: Home / Articles / Multiple-Failure-Mode Reliability Estimation

by Larry George 5 Comments

Multiple-Failure-Mode Reliability Estimation

Multiple-Failure-Mode Reliability Estimation

“It is the policy of my Administration to respond to the coronavirus disease 2019 (COVID-19) pandemic through effective approaches guided by the best available science and data” [Biden Executive order, 2021]. That epidemic inspired the simultaneous nonparametric estimation of survival functions from case to recovery and case to death, without lifetime data (figure 1)!

Why not do the same for multiple-failure-mode data? This article shows nonparametric, multiple-failure-mode, maximum likelihood reliability estimation in a spreadsheet. Data are system first-failure times and the corresponding failure modes that caused the first system failures (table 1). However those data are dependent. I will explain the likelihood function, lnL, and how to find the maximum likelihood reliability estimates for all failure modes simultaneously.

plot  survival function over timefor conditional on death or cure
Figure 1. US  Corona virus maximum likelihood “survival” function estimates conditional on recovery (Cure) or Death

Some articles and statistical software use Kaplan-Meier estimators for each failure mode [Dignam et al., ReliaSoft-ReliaWiki, Minitab, Others?]. Unfortunately, the Kaplan-Meier estimator is for independent, identically distributed, perhaps censored data [Mailman School of Public Health, Prentice et al., Yizeng He et al., Thernau et al. 2024, Lin et al.]. Using Kaplan-Meier results in reliability estimates that are biased low. 

Simultaneous estimates could be produced by constrained least squares; e.g., COVID-19 “survival” function estimation in two modes: recovery and death, from periodic case, recovery, and death counts (figure 1) [George 2021]. Why not maximize the multiple-mode likelihood function constrained by the percentages of failures in each mode [Prentice et al., George 2025]? Table 1 shows some multiple-mode, first-failure time data for example.

Table 1. Multiple-Mode, first-failure-time data (first three columns) [Rao, Meeker et al.] and likelihoods 

KilometersFail ModeCensored?L=LikelihoodlnL
6700Mode1Failed0.01681-4.0857
6950CensoredCensored0.98319-0.017
7820CensoredCensored0.98319-0.017
8790CensoredCensored0.98319-0.017
9120Mode2Failed0.01836-3.9978
9660CensoredCensored0.96483-0.0358
9820CensoredCensored0.96483-0.0358
11310CensoredCensored0.96483-0.0358
11690CensoredCensored0.96483-0.0358
11850CensoredCensored0.96483-0.0358
11880CensoredCensored0.96483-0.0358
12140CensoredCensored0.96483-0.0358
12200Mode1Failed0.02032-3.8961
12870CensoredCensored0.94451-0.0571
13150Mode2Failed0.02168-3.8312
13330CensoredCensored0.92283-0.0803
13470CensoredCensored0.92283-0.0803
14040CensoredCensored0.92283-0.0803
14300Mode1Failed0.02225-3.8056
17520Mode1Failed0.02225-3.8056
17540CensoredCensored0.87833-0.1297
17890CensoredCensored0.87833-0.1297
18450CensoredCensored0.87833-0.1297
18960CensoredCensored0.87833-0.1297
18980CensoredCensored0.87833-0.1297
19410CensoredCensored0.87833-0.1297
20100Mode2Failed0.02783-3.5817
20100CensoredCensored0.85051-0.1619
20150CensoredCensored0.85051-0.1619
20320CensoredCensored0.85051-0.1619
20900Mode2Failed0.03086-3.4783
22700Mode1Failed0.02878-3.5482
23490CensoredCensored0.79087-0.2346
26510Mode1Failed0.02987-3.5107
27410CensoredCensored0.761-0.2731
27490Mode1Failed0.03108-3.4712
27890CensoredCensored0.72992-0.3148
   lnL->-44.013

Spreadsheet Implementation

You could copy this article’s tables into your own spreadsheet, or you could ask for my spreadsheet with cell formulas. Some work is required to reproduce the formulas that generate the numbers in tables 1-4.  I hope the following explanation explains how to do it. 

The dependence among failure modes is represented by the proportions of failures in the 38 independent random samples in which each failure mode is the first failure, table 2. Kaplan-Meier estimates by failure mode do not yield these proportions. Some people have tried to use the failure mode as a factor in Kaplan-Meier, proportional hazards, reliability estimates [Agrawal et al.], using statistical software [e.g., Thernau “Survival” R-package].

Table 2. Failure proportions and counts out of 38 in table 1: 27 were both censored

ModeProportion
Mode10.18421
Mode20.10526
 Failure Counts
Mode17
Mode24
Total38

For each observation time t, the likelihoods in table 1 consists of terms such as p(t;mode)*(1-F(t;other mode) if failure at age t or (1-F(t;mode)*(1-F(t;other mode). The p(t;.) is the discrete probability of failure at age t, and F(t;.) is the discrete cumulative distribution function of time or kilometers to failure. Reliability is (1-F(t;mode)). Likelihood L is the product of all of the terms. The product could cause underflow, so maximize the sum of the logarithms of the likelihoods. For more detailed explanation about inappropriate and appropriate likelihood function(s) for multiple mode failure data, please refer to https://fred-schenkelberg-project.prev01.rmkr.net/statistical-software-problem/.  

Table 3 contains estimates of p(t;mode), and table 4 contains the corresponding reliability estimates R(t;mode)=1-F(t;mode) computed from table 3. Tables 3 and 4 are used in table 1 to construct the likelihoods row by row, manually, using either p(t;mode)*(1-F(t;other mode) if failure at age t or 1-F(t;mode1)*(1-F(t; mode2) depending on censoring. Nested IF(IF()) statements could do the same, less clearly. Table 3 could be copied anywhere convenient in a spreadsheet. Table 4 should be copied so its rows line up with table 1 for convenience. 

Table 3. Discrete probabilities of failures at ages (km) t: p(t;mode). He “KM p-h ” p(t;mode) estimates are derived from the Kaplan-Meier proportional hazards (modes) estimates

Kmp(t;mode1)p(t;mode2)KM p-h 1KM p-h 2
67000.0168100.0263 
912000.01867 0.0286
122000.0207100.0364 
1315000.02253 0.0379
143000.023200.0435 
175200.023200.0435 
2010000.03038 0.0653
2090000.03369 0.0898
227000.0321600.0899 
265100.0333900.1077 
274900.0347400.1437 
Sums0.184210.105260.4910.2216
Proportions0.184210.10526  

Table 4 is built from table 3 p(t;mode) as R(t;mode)=1-∑p(s;mode), s=1,2,…,t, at failure times t in each mode. F(t;mode) is constant in between modal failure times. KM1 and KM2 are the Kaplan-Meier reliability estimates for each failure mode. Notice that reliability R(28100 km;mode1)=0.81579=1-0.18421 and R(28100 km;mode2)=0.89474=1-0.10526 agrees with the failure proportions in table 2. The Kaplan-Meier estimates KM1 and KM2, at 28100 km  disagree. 

Table 4. R(t;mode)=1-F(t;mode) estimates, from the sums, F(t;mode)=∑p(s;mode) s=1,2,…,t. 

KmR(t;Mode1)R(t;Mode2)KM1KM2
01111
67000.9831910.973681
69500.9831910.973681
78200.9831910.973681
87900.9831910.973681
91200.983190.981330.973680.97059
96600.983190.981330.973680.97059
98200.983190.981330.973680.97059
113100.983190.981330.973680.97059
116900.983190.981330.973680.97059
118500.983190.981330.973680.97059
118800.983190.981330.973680.97059
121400.983190.981330.973680.97059
122000.962480.981330.936230.97059
128700.962480.981330.936230.97059
131500.962480.95880.936230.93015
133300.962480.95880.936230.93015
134700.962480.95880.936230.93015
140400.962480.95880.936230.93015
143000.939280.95880.889420.93015
175200.916080.95880.842610.93015
175400.916080.95880.842610.93015
178900.916080.95880.842610.93015
184500.916080.95880.842610.93015
189600.916080.95880.842610.93015
189800.916080.95880.842610.93015
194100.916080.95880.842610.93015
201000.916080.928420.842610.85263
201000.916080.928420.842610.85263
201500.916080.928420.842610.85263
203200.916080.928420.842610.85263
209000.916080.894740.842610.74606
227000.883910.894740.722240.74606
234900.883910.894740.722240.74606
265100.850530.894740.577790.74606
274100.850530.894740.577790.74606
274900.815790.894740.385190.74606
278900.815790.894740.385190.74606
281000.815790.894740.385190.74606

Figure 2 shows the bias in the Kaplan-Meier reliability proportional-hazards estimates (KMPH) compared with the maximum likelihood estimates. The lines are the maximum likelihood reliability estimates, and the dots are the Kaplan-Meier proportional hazards estimates.  

plot of reliability over kilometers with the two different failure modes and reliability estimates
Figure 2. Table 4 reliability function estimates by failure mode

Figure 3 shows the nonparametric maximum likelihood discrete failure rate function estimates, at times (kilometers) of failures by failure mode. The proportional hazards assumptions seems reasonable, even if the resulting reliability estimates In Figure 1 are biased. That bias is because independent, Kaplan-Meier, proportional hazards estimates do not account for the dependence in proportions of modal failures.

plot of kilometers and actuarial failure rate with the two failure modes shown with different colors
Figure 3. Actuarial (conditional) discrete failure rate (hazard rate) maximum likelihood estimates

Maximum Likelihood estimation

Use Excel Solver to maximize lnL (below table 1) as a function of the p(t;mode) values in table 3. Set a constraint that the sums of failure mode probabilities,∑p(s;mode), equal the proportions in table 2. I tried doing maximization on table 4 directly, but Excel complained about too many variables.  

Recommendations

Don’t use the Kaplan-Meier reliability function estimates for multiple-mode first-failure times, because they don’t account for observed proportional dependence among the failure modes. They are biased low because, for the older failure times, each Kaplan-Meier reliability estimate  are multiplied by (1-1/survivors)->0 as the number of survivors decreases.

Doubt statistical software the allows specifying parametric distributions for each failure mode. They may use parameter estimates in each failure mode separately without accounting for dependence. I failed to make the maximum likelihood estimation work for Weibull reliability functions on another data set with the proportionality constraint. 

One paper describes using a bivariate Weibull distribution estimate, analogous to the Marshall-Olkin bivariate exponential distribution. That accounts for simultaneous failures, if any [Agrawal et al.].

Why not do multiple-mode reliability estimation, without lifetime data? Joe Biden says so [Executive Order]! Even with multiple-mode data? Generally Accepted Accounting Principles requires statistically sufficient data, and it’s population data! It is reasonable to count failures by failure mode; e.g., product sales, BoMs, and simultaneous spare parts’ sales or demands; Corona Virus case, recovery, and death counts [George 2025].  

References

Aakash Agrawal, Debanjan Mitra, and Ayon Ganguly, “A Model for Censored Reliability Data with Two Dependent Failure Modes and Prediction of Future Failures,” arXiv:2206.12892v1 [stat.ME], 26 Jun 2022 

Joseph Biden, “Executive Order on Ensuring a Data-Driven Response to COVID-19 and Future High-Consequence Public Health Threats,” Jan. 21, 2021

James J Dignam, Qiang Zhang, Maria N, Kocherginsky, “The Use and Interpretation of Competing Risks Regression Models,” Clin Cancer Res.;18(8):2301–2308. doi: 10.1158/1078-0432.CCR-11-2097, Jan. 2012

L. L. George, “COVID-19 Survival Analysis,” https://sites.google.com/site/fieldreliability/corona-virus-survival-analysis/ Jan. 2021

L. L. George, ”Statistical Software Problem,”  https://fred-schenkelberg-project.prev01.rmkr.net/statistical-software-problem/#more-582018/, Jan. 2025

Yizeng He, Kwang Woo Ahn , Ruta Brazauskas, “A Review of Competing Risks Data Analysis,” https://www.mcw.edu/-/media/MCW/Departments/Biostatistics/TR070.pdf/  

Guixian Lin, Ying So, Gordon Johnston, “Analyzing Survival Data with Competing Risks Using SAS® Software,” SAS Global Forum 2012

Meeker, William Q. and Luis A. Escobar, “Degradation Data, Models, and Data Analysis,” 2003, http://www.public.iastate.edu/~wqmeeker/stat533stuff/psnups/chapter13_psnup.pdf

E. L. Kaplan and P. Meier, ”Nonparametric Estimator From Incomplete Observations,” J. Amer. Statist. Assn., Vol. 53, pp. 457-481, 1958

Mailman School of Public Health, “Competing Risk Analysis,” Columbia U. Irving Medical Center, https://www.publichealth.columbia.edu/research/population-health-methods/competing-risk-analysis#/

Ross L. Prentice, J. D. Kalbfleisch, A. V. Peterson, N. Flournoy, V. T. Farewell, and N. Breslow, “The Analysis of Failure Time Data in the Presence of Competing Risks,” Biometrics, Vol. 34, pp. 541-554, 1978

Ross L. Prentice and J. D. Kalbfleisch, “Hazard Rate Models with Covariates,” Biometrics, Vol. 34, No. 1, Perspectives in Biometry, pp. 25-39, March 1979

Reliasoft-Reliawiki, “Competing Failure Modes (CFM) Analysis,” https://www.reliawiki.com/index.php/Competing_Failure_Modes_Analysis/, Sept. 2023

Terry Thernau, “Package ‘survival,’” https://github.com/therneau/survival/, Dec. 2024

Terry Therneau, Cynthia Crowson, and  Elizabeth Atkinson, “Multi-state models and competing risks,” https://cran.r-project.org/web/packages/survival/vignettes/compete.pdf/, December 17, 2024

Filed Under: Articles, on Tools & Techniques, Progress in Field Reliability?

About Larry George

UCLA engineer and MBA, UC Berkeley Ph.D. in Industrial Engineering and Operations Research with minor in statistics. I taught for 11+ years, worked for Lawrence Livermore Lab for 11 years, and have worked in the real world solving problems ever since for anyone who asks. Employed by or contracted to Apple Computer, Applied Materials, Abbott Diagnostics, EPRI, Triad Systems (now http://www.epicor.com), and many others. Now working on actuarial forecasting, survival analysis, transient Markov, epidemiology, and their applications: epidemics, randomized clinical trials, availability, risk-based inspection, Statistical Reliability Control, and DoE for risk equity.

« Hypothesis Testing Part-5: Chi-Square Test of One Variance
Systems/Validation Engineering (Part 2) »

Comments

  1. Larry George says

    January 14, 2025 at 10:36 AM

    This is a link to the Google Sheet that is described in the Multiple-Failure-Mode article;
    https://docs.google.com/spreadsheets/d/1iH-u71STujVm-C7S2bp3xmoJBOxLOFC9xfBQn-10ndA/edit?usp=sharing/
    You can see the cell formulas in that sheet, but Google won’t do the optimization to find the max. likelihood nonparametric reliability estimates (already shown, for the input data in table 1).

    Reply
  2. Shishir Rao says

    February 17, 2025 at 12:14 PM

    Copying and pasting my email response here.

    ====================================================

    Hello Dr. George,

    My apologies for the late reply. I was caught up with some work and couldn’t get the time to look into this.

    Thanks for writing the article! I do have some thoughts on it that I wanted to share with you.

    Suppose we have only 2 modes of failure – mode 1 and mode 2. I am assuming that you are trying to answer questions of the type: What is the probability of failure from mode 1 by time “t” AND this failure occurs before mode 2? If my assumption is right, then this is the exact definition of the cumulative incidence function, also called the sub-distribution function. Using Kaplan Meir estimator in a competing risk context is different to the cumulative incidence function. The Kaplan Meir estimator for mode 1 failures is calculated by treating mode 2 failures as “censored”. This would give us the probability of a mode 1 failure in a hypothetical world where failures can occur ONLY due to mode 1 and not due to mode 2. Unless mode 2 is eliminated by changing design, we are still interested in a world where modes 1 and 2 are both possible. I agree that we should not use the Kaplan Meir approach in a real-world scenario where both modes are active. But I am a bit confused as to why you seem to equate the Kaplan Meir approach with the cumulative incidence function, since both are different. The cumulative incidence of mode 1 takes into account the rate at which mode 2 failures are occurring. Also, the non-parametric version of the cumulative incidence function can be estimated without making any assumptions of the joint distribution of the failure times of the 2 modes of failures. So, we are not making any assumptions of independence of the time-to-failure of the two modes.

    Now, coming back to the original question of interest: What is the probability of failure from mode 1 by time “t” AND this failure occurs before mode 2? I see that you have used a likelihood-based approach and have constrained the maximization to the observed proportions of the two failure modes. I have to admit, I am not familiar with this approach to answering the above question. Doesn’t mean I don’t think it is correct! Just that I am not sure why we would have 2 different answers to the same probabilistic question. Unless you are trying to answer a different question altogether?

    Would love to hear your thoughts on this.

    Regards,
    Shishir.

    PS: Do you mind if I post this comment on your article? I think this is a good discussion and others might benefit from it and there is a possibility that others might chip in their 2 cents.

    ====================================================

    Reply
    • Larry George says

      February 19, 2025 at 5:03 PM

      Thanks. I was unfamiliar with the cumulative incidence function and was trying to estimate the conditional reliability functions R(t|failure mode 1) and R(t|failure mode 2) and their joint distribution function to test for independence.
      I asked Google AI, “What’s wrong with Kaplan-Meier for multi-mode reliability estimation from censored lifetime data?
      Google AI replied: “Using a Kaplan-Meier reliability estimator with multiple mode lifetime failure data is problematic because the Kaplan-Meier method is designed to analyze a single failure mode, meaning it cannot accurately capture the different failure mechanisms present in a dataset with multiple modes of failure, potentially leading to misleading results when trying to estimate overall reliability.
      Key reasons why Kaplan-Meier is not suitable for multiple mode failure data:
      Assumption of a single failure process:
      The Kaplan-Meier method assumes that all failures occur due to a single underlying process, while multiple mode failure data implies different failure mechanisms with distinct failure times, violating this assumption.
      Inability to differentiate failure modes:
      When analyzing multiple failure modes with Kaplan-Meier, you cannot identify which failure mode caused each event, leading to an inaccurate representation of the overall reliability.
      Misinterpretation of survival curve:
      The Kaplan-Meier curve generated from multiple mode data could be misinterpreted as representing a single failure mode, potentially leading to incorrect conclusions about system reliability.
      Alternative approaches for multiple mode failure data:
      Competing risks analysis:
      This method explicitly accounts for different competing failure modes by estimating the probability of each failure mode occurring, given that another mode has not already occurred.
      Failure mode specific analysis:
      Analyze each failure mode separately using Kaplan-Meier, but interpret the results carefully considering the potential interactions between different modes.
      Parametric models:
      If the underlying distributions of each failure mode are known, use parametric models that can incorporate multiple failure mechanisms.”

      Reply
      • Shishir Rao says

        February 19, 2025 at 6:41 PM

        Hello Dr George, I would also suggest looking into the issue of “non-identifiability” of dependence among failure modes in competing risks (also called “identifiability dilemma”). It basically says that, given what we are able to observe in a competing risks framework (T, j), where “T” is the time to failure and “j” is the failure mode, it is not possible to say anything about the dependence structure between the risks. The observed data could have come from a joint distribution with strong, weak or no dependence at all – we cannot say.

        Reply
        • Larry George says

          February 20, 2025 at 11:08 AM

          This Google Sheet shows the joint probability density max. likelihood estimates of {p(s|mode 1, p(t|mode 2)} in table 3 of
          https://docs.google.com/spreadsheets/d/1iH-u71STujVm-C7S2bp3xmoJBOxLOFC9xfBQn-10ndA/edit?usp=sharing/
          It’s for age s and t values up to the oldest failures in either failure mode.

          Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Articles by Larry George
in the Progress in Field Reliability? article series

Join Accendo

Receive information and updates about articles and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Articles

  • Gremlins today
  • The Power of Vision in Leadership and Organizational Success
  • 3 Types of MTBF Stories
  • ALT: An in Depth Description
  • Project Email Economics

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy