Statistical Software Problem?

When a system fails for the first failure in one mode at time t, this data is right censored data for other failure modes! How to estimate reliability functions for all failure modes from first failure data?

Google AI says, “’Competing risks’ refers to a statistical scenario where a subject can experience failure from multiple possible causes, but once one failure occurs, it prevents the observation of any other potential failures, essentially creating “multiple failure modes” that compete with each other to be the first event observed; this means analyzing the probability of a specific failure type needs to account for the possibility of other competing failures happening first.” “Use appropriate statistical methods: Employ statistical models specifically designed for competing risks analysis…”

Minitab uses the Kaplan-Meier nonparametric reliability estimator for each failure mode [Kaplan and Meier, Schenkelberg]. Using only first failure times for each mode ignores other failure modes’ survival times after being censored by the first part’s failure. The Kaplan-Meier reliability estimates for each failure mode are biased too low. Others have noticed this problem [Mailman School of Public Health].

History of Minitab?

Minitab started in 1972 with a NIST computer program OMNITAB. “The documentation for the latest version of OMNITAB, OMNITAB 80, was last published in 1986, and there has been no significant development since then,” [https://en.wikipedia.org/wiki/Minitab/]. Cofounder and mathematician Barbara Ryan is still CEO of Minitab. Minitab’s scope has broadened into non-statistical subjects such as ERP. Minitab has a nice user interface, which may mislead users in censored, multiple-mode reliability estimation. Minitab is advertising for a “Statistical/Analytics Consultant” in Coventry UK.

Jonathan of Minitab Technical Support, wrote to me, “Minitab does provide a way to analyze multiple failure modes as part of Parametric Distribution Analysis. You can click the FMode button to enter the failure mode information. Also, you can click on the Results button and check the box for Display analyses for individual failure modes according to display of results to display full results for each failure mode.” I suspect Minitab estimation problems with multiple-mode parametric and nonparametric distribution choices.

Here’s Some Data!

I estimated nonparametric reliability functions simultaneously from table 1 data for each failure mode. Account for censoring using maximum likelihood and account for dependence by constraining maximization to yield the proportions of first failures in table 2. I compared those maximum likelihood estimates with the independent Kaplan-Meier estimates for each failure mode from the first failure time and mode data.

Table 1. Data are ages at first failure mode detected at times of overhaul.

Unit	Overhaul Hours	Failure Mode
80	31995	Gears
79	31677	Camshaft
78	31597	Head
77	31205	PC
76	31087	Bearings
Etc.
6	10394	PC
5	10317	Head
4	8970	PC
3	8900	Camshaft
2	8553	Gears
1	8042	Gears

Table 2. Proportions of each failure mode as first failure out of 80 first failures

Bearings	Camshaft	Gears	Head	PC	Crankshaft
0.0500	0.2125	0.2250	0.1625	0.3125	0.0375

What’s Wrong?

The Kaplan-Meier estimator assumes the lifetime data are independent, identically distributed samples of failures and survivors’ ages. The Kaplan-Meier likelihood function is

∏[(a(t)^d(t)*(1-a(t))^n(t)-d(t)]*COMBIN(d(t),n(t)) = ∏[(a(t)^d(t)*(1-a(t))^n(t)-d(t)]*n(t)!/(d(t)!*(n(t)-d(t)!),

where a(t) are actuarial failure rates conditional on survival to age t, d(t) counts the single-mode failures at age t, and n(t) counts the other mode survivors to age t. The product is over all observed failure ages t=1,2,…,80. This likelihood function is the product of terms representing the probability of d(t) survivals to failure at age t out of n(t) survivors up to age t. It does not account for multiple failure modes and first failures.

Multiple failure modes and first failure times, data are statistically dependent. The likelihood function for multiple-mode first failures at age t (table 1, Bearings for example) is the product of terms such as

p(t; Bearings)*∏(1-F(t; mode),

where p(t; Bearings) denotes the probability density function of bearings’ ages at failures times t, 1-F(t; mode) denotes the survival of the other parts, and the product ∏(1-F(t; mode)) is over all other failure modes (Gears, Camshaft, Head, PC, and Crankshaft). I.e., the likelihood function is

∏(p(t; mode k)*∏(1-F(t; mode j)) j not equal to k, for first-failure times t=1,2,…,80.

Each multiplicand of the likelihood function is the probability of the first part failing at age t AND all other parts surviving past age t. The difference from Kaplan-Meier likelihood is that (this multiple-mode likelihood represents) all other parts’ failures or failure modes occur at ages older than t.

Excel Solver Maximizes Log Likelihood

I made an Excel spreadsheet with columns for each failure mode (Gears, Bearings, Camshaft, Head, PC, and Crankshaft) and rows for each of the 80 failure times in table 1. Each cell entry contains, for example

p(t; Bearings)*(1-F(t; Gears))*(1-F(t; Camshaft))*(1-F(t; Head))*(1-F(t; PC))*(1-F(t; Crankshaft)).

Then I multiplied row cells together in another column. The likelihood is the product of each row’s products. It’s value was too small for Excel computation, so I summed log likelihoods for maximization by varying cell contents p(t; mode) (and consequently F(t;mode)=∑p(s;mode), s=1,2,…,80). Constrain the Excel Solver maximization to equal make the sum of p(t;mode) estimates equal to the parts’ observed failure mode proportions in table 2, to represent the dependence among failure modes. Figure 1 shows the maximum likelihood reliability function estimates for each failure mode.

Plot of maximum likelihood reliability curves for each of the size components individually

Figure 1. Maximum likelihood reliability estimates account for multiple failure modes.

Notice some approximately annual 11,000 hour, 22,000, and 33,000-hour changes in the reliability estimates? (An average year is 8766 hours, and a year plus one quarter is 10,957.5 hours.) These changes may be failures detected on overhauls. (I confess these estimators in figure 1 should show step functions changing at observed first failure times. It’s awkward to make Excel show step functions instead of sloping lines between data points.)

Compare with Kaplan-Meier?

Kaplan-Meier failure mode reliability estimates do not show approximately annual behavior and underestimate reliability, because they presume independence and ignore survivals after first failure. The 80 failure times in the original data are for 80 units’ first-failure ages and all the other parts survive. Figures 2-6 show that the Kaplan-Meier estimates are pessimistic compared with the maximum likelihood estimates that include terms for first failure times and survival of all other failure modes.

Minitab’s use of the Kaplan-Meier estimator ignores the dependence among the failure modes represented by their proportions of the data, table 2, even though it counts survivors out of 80 in other failure modes. The distributions of ages at first failures are dependent, and ignoring that dependence biases reliability estimates, parametric or nonparametric. This applies to parametric distribution estimates as well as nonparametric [Reliawiki]. Others have used failure mode as a factor in proportional hazards models using the Kaplan-Meier proportional hazards estimator in R-package “Survival” [Rao].

plot or reliability for gears using maximum likelihood and Kaplan-Meier approaches, and hte KM approach shows a lower reliability values then using MLE

Figure 2. Gears maximum likelihood and GearsKM Kaplan-Meier reliability estimates

reliability plot of Camshaft using MLE and KM - again showing KM with lower reliability for the same data.

Figure 3. Camshaft maximum likelihood and CamshaftKM Kaplan-Meier reliability estimates. Vertical axis reliability scale is from 0 to 1.

reliability plot of Bearings using MLE and KM, again KM has lower reliability

Figure 4. Bearings maximum likelihood and BearingsKM Kaplan-Meier reliability estimates

reliability plot of PC using MLE and KM, again with KM showing less reliability

Figure 5. PC maximum likelihood and PCKM Kaplan-Meier reliability estimates. Vertical axis, reliability scale in now 0 to 1.0

reliability plot of head data using MLE and KM, and KM results in lower reliability values

Figure 6. Head maximum likelihood and HeadM Kaplan-Meier reliability estimates. Vertical axis, reliability scale in now 0 to 1.0

Recommendations

Don’t use the Kaplan-Meier estimator from cohorts of installed base and grouped failure counts when cohort sizes vary over time. Don’t use Greenwood’s variance estimate [Greenwood, George 2022-2024].

Don’t use Minitab’s Kaplan-Meier estimator for nonparametric multiple-mode, censored reliability estimation. The Kaplan-Meier estimator is pessimistic for first-failure times in multiple-mode reliability estimation, because it does not account for dependence and the subsequent survival in other failure modes.

Prevent innocent misuse of Minitab. Techsupport@minitab.com says, “We appreciate your feedback and are always interested in ideas for improving the software!” Maybe I can get that job in Coventry.

References by George

“Fred’s Bicycles and Kaplan-Meier Error,” Nov. 2024

“Kaplan-Meier Ignores Cohort Variability,” Oct. 2024

“What Price Kaplan-Meier?” Jan. 2022

“Variance of the Kaplan-Meier Estimator?” March 2023

“Covariance of Kaplan-Meier Estimators?” March 2023

References

Greenwood, M., “The natural duration of cancer. Reports on Public Health and Medical Subjects,” Vol. 33, pp. 1–26, His Majesty’s Stationery Office, London, 1926

E. L. Kaplan and P. Meier, ”Nonparametric Estimator From Incomplete Observations,” J. Amer. Statist. Assn., Vol. 53, pp. 457-481, 1958

Mailman School of Public Health, “Competing Risk Analysis,” Columbia U. Irving Medical Center, https://www.publichealth.columbia.edu/research/population-health-methods/competing-risk-analysis#/

Reliawiki, “Competing Failure Modes (CFM) Analysis,” https://www.reliawiki.com/index.php/Competing_Failure_Modes_Analysis/, Sept. 2023

Shishir Rao, “Competing Risks in Failure Time Data,” https://fred-schenkelberg-project.prev01.rmkr.net/competing-risks-in-failure-time-data/, Nov. 2024

Fred Schenkelberg, “Kaplan-Meier Reliability Estimator,” Accendo Weekly Update, https://fred-schenkelberg-project.prev01.rmkr.net/kaplan-meier-reliability-estimator/#more-25994/ 2019

Comments

André-Michel Ferra says
December 31, 2024 at 8:55 AM
Always a pleasure to read from you Larry. Thanks for your generosity and will to share your knowledge! Happy new year!
- Larry George says
  December 31, 2024 at 10:49 AM
  Thanks. I dislike being critical of statistical software, but… Jonathan, of Minitab Technical Support, today emailed me that he forwarded the article to developers for their consideration. Let me know if you would like the spreadsheet example that I will describe to readers, if Shishir Rao, Suncor, Calgary, allows.
Larry George says
January 10, 2025 at 1:57 PM
“Understanding underlying principles is critical to be able to interpret results and troubleshoot issues…If the only way you know how to deal with data science problems is to use a piece of software, then you’re going to be replaced with a piece of software.” Brian Conrad, Stanford Mathematician, in “Algebra Strikes Back,” California magazine, Fall-Winter 2024