The Worst Reliability Requirement

by Mark Powell

Most of us have seen reliability specified using a requirement like the following:

The Zeus 5000 SUV shall have an MTBF of 144,269.5 miles with a 90% confidence.

Some readers may not have seen reliability requirements specified in any other way. What they have always seen has read something like: The widget shall have an MTBF of X with a Y% confidence. This reliability requirement structure is rather ubiquitousin military and aerospace specifications, which along with Mil-HDBK-217, have been major influences in reliability specification practices for decades in many industries.

If this specification were to be found in the Zeus 5000 SUV product brochure, the average car buyer might be impressed. Some engineers might even be impressed.

But as a systems engineer, I can tell you that there are a number of major problems with that reliability requirement.

Problem Number One

What was the requirement writer thinking when they wrote that requirement? Were they in a Mil-HDBK-217 mindset? Were they thinking that the exponential distribution model would accurately model Zeus 5000 SUV failures?

The exponential distribution model has the following density function, which uses MTBF as its only parameter.

$$ f\left(t\right)=\frac{1}{\theta}e^{-\frac{1}{\theta}} $$

If they were in a Mil-HDBK-217 mindset and thinking of an exponential failure model, this MTBF specification of 144,269.5 miles means that the Zeus 5000 SUV has just a 50% reliability at 100,000 miles. Is this what the requirement writer really wanted, a 50% probability of failure before 100,000 miles? That spec in the product brochure would probably not sell a lot of Zeus 5000 SUV’s. Could you then blame the manufacturer for using an MTBF specification in their sales literature? It certainly sounds much, much better.

The use of MTBF as a reliability specification can be very misleading to the customer, and perhaps to the engineers responsible for a design to meet it, which is usually not desirable.

Problem Number Two

Suppose this requirement is given to the Zeus 5000 SUV design team as is. The design team must develop a cost-effective vehicle design that will satisfy this requirement, exactly as stated. Because MTBF is the one and only parameter of the exponential distribution model, design engineers almost always automatically interpret such a requirement to mean that they should design the product to fail with an exponential model. Often they don’t even realize that other models could be used. Because of the Memoryless property of the exponential model, anything in nature or made by the hand of mankind can only at best approach a failure mode consistent with the exponential model. No matter how hard the design team works at it, they will not be able make the Zeus 5000 SUV fail exactly with an exponential failure distribution. That is not critically important however, but it could cause some design effort to be wasted.

Suppose this design team is aware of other models besides the exponential. The design engineers have been asked (or required) to design the Zeus 5000 SUV such that it will fail with an MTBF of 144,269.5 miles. The MTBF calculation does not depend on any particular failure distribution. So, for a given MTBF, the resulting reliability at 100,000 miles can be quite different from the 50% (or whatever) value that the requirement writer may have actually had in mind. Figure 1 shows three different failure distributions that a design team might produce for the Zeus 5000 SUV, all with an MTBF of 144,269.5 miles (green vertical line).

Three plots of failure distribuitons, for early failures, steady state failures, and wear out failures. — Figure 1: Three different design failure distributions for the Zeus 5000 SUV

Design failure distribution B is the exponential model that the requirement writer might have had in mind if they were in a Mil-HDBK-217 mindset. The reliability at 100,000 miles is 50%. Design failure distribution A has a reliability at 100,000 miles of 30.8%, i.e., 69.2% of failures will occur before 100,000 miles (crosshatched area under the curve).

Now you might ask, why would the designers want less reliability in the design for the Zeus 5000 SUV? Well, they probably wouldn’t personally; engineers tend to take pride their work. But recall that based on the requirement they were given, they were not asked to design for any particular reliability level, just to meet that MTBF specification with their design. The design engineers are almost always under pressure from management to minimize costs in the design. A design that fails more often earlier in its life is usually much cheaper to manufacture than one that fails later. Such an effort to please management with cost savings, producing a failure distribution like figure 1’s design failure distribution A, comes with hidden costs for the company; the lower the reliability, the higher the warranty repair costs. These two factors play a significant role in the cost trades that should be performed to come up with a reliability specification in the first place. If the warranty cost requirement is not provided simultaneously with reliability specified using an MTBF, the desire to drive down manufacturing costs may result in an even worse reliability than that in design failure distribution A. If the design and manufacture of the Zeus 5000 SUV is to be subcontracted out, and the Zeus Company is responsible for warranty repairs, a design and manufacturing subcontractor given an MTBF based reliability requirement can have a much lower bid with a product design that fails more often earlier rather than later, all while meeting that MTBF to the letter.

Design failure distribution C has a reliability of 95.3% at 100,000 miles, only 4.7% of failures will occur before 100,000 miles (crosshatch under the curve). What if the requirement writer was not in a Mil-HDBK-217 mindset, and truly wanted at least 95% reliability at 100,000 miles to reduce warranty repair costs to something tolerable? The requirement writer did not ask for >95% reliability at 100,000 miles in the Zeus 5000 SUV, just an MTBF of 144,269.5 miles.

The use of MTBF in a reliability specification can easily mislead a design team or contractor into producing low reliability designs in the interest of cost savings.

Problem Number Three

The third problem has to do with verification or testing. Verification or testing for reliability, or for an MTBF specification, can be quite expensive. Two of the primary drivers for why verification for reliability is expensive are how many products and how long they need to be tested to observe enough failures to calculate a reasonable estimate of MTBF. If the design that meets the MTBF specification has a distribution that will produce earlier failures, then the number of products tested and duration of the test can be decreased. This means that the test should be less expensive. Designs that fail with design failure distribution A in figure 1 will be much less expensive to test than designs that fail with either failure distribution B or C, all because they will fail more often earlier.

The use of MTBF in a reliability specification can easily mislead a design team or contractor into producing low reliability designs in the interest of saving verification costs.

Problem Number Four

This is also a problem in verification and testing, but not so much a problem with MTBF. It has to do with the word confidence in the requirement. Normally, when we use the word confidence about a property of a product, we usually mean something along the lines of how sure we can be that property is true. So our MTBF based reliability requirement means to most of us that we should be 90% sure that the MTBF for the Zeus 5000 SUV would be 144,269.5 miles. That presents a problem. The probability that the MTBF for the Zeus 5000 SUV is any specific value is identically zero. If the requirement writer were aware of this, they might rewrite the requirement (adding the words at least) to read:

The Zeus 5000 SUV shall have at least an MTBF of 144,269.5 miles with a 90% confidence.

This seems at first glance that it solves the problem, or at least addresses this verification issue. It is possible now to calculate a probability that the MTBF ³ 144,269.5 miles. But does that help anything relative to verified reliability? If you revisit figure 1, a slight shift of each of the design failure distributions to the right will move the MTBF (green lines) also to the right, and you can still have a large variation of reliabilities possible at 100,000 miles in the design. So, even an MTBF based reliability requirement that fixes that problem still does not help much with regard to the reliability achieved. This of course is due to use of MTBF in the requirement.

But it gets worse as a result of the word confidence. When design and verification engineers see that word confidence in a requirement, they almost always go immediately to classical statistics tests using a confidence interval or a confidence level. One of the dirty little secrets often not emphasized in the Stats 101 classes engineers take in their undergraduate curricula is that an X% confidence interval does not contain X% probability. In fact, a confidence interval (a one-sided confidence interval for our improved requirement is often referred to as a confidence level) does not contain probability at all (Pearson, when he invented the confidence interval specifically used the word confidence so that no one would make the mistake of thinking it contained probability – that’s rather ironic today). That being said, we can still compute the probability that the MTBF is inside the limits of a 90% confidence interval, or rather how sure we can be that the true value of the MTBF is between the 90% confidence interval limits. The big surprise is that this probability will always be larger than 90% due to the inherent conservatism in classical stats recipes, quite often much, much larger. In such calculations for some real world problems, I have observed that the probability over the limits of a 90% confidence interval actually was as high as 99.92% probability.

At first glance, that doesn’t sound bad. But how much more did it cost to actually design the Zeus 5000 SUV and verify that its MTBF is inside a 90% confidence interval, or that its MTBF is greater than 144,269.5 miles with a 90% confidence level? In my experience, the verification costs alone can double to provide that extra unrequired 9.92% of assurance.

Verifying a reliability using an MTBF specification with a specified confidence tells us very little about what the product’s actual reliability might be, and how sure we can be that the reliability is high enough. Using classical stats recipes to verify a reliability requirement will always provide more assurance than we really need, at significant additional verification costs.

Problem Number Five

This is a systems engineering problem. This requirement combines the performance requirement with the verification requirement. Requirements documents typically have a performance characteristics section, and a quality assurance or verification section. The performance characteristic in this requirement is the required value of reliability for the Zeus 5000 SUV, or in this case the required value of its MTBF. The verification specification is that the design will be satisfactory if we can have 90% confidence that this performance is satisfied with the test results.

Specifying performance and verification in the same requirement is not only a faux pas in systems engineering, it can lead design engineers to designing the product specifically to successfully pass a test, rather than to focus on potential failure modes and the physics of potential failures. This can be especially problematic if a subcontractor is developing the design.

Conclusion

One of the admonitions I use regularly in my Systems Engineering classes is to say what you mean, and mean what you say. Separating the performance requirement for reliability and its verification into two requirements is easy. Also easy is to specify the reliability that you really need in a simple requirement.

Suppose in our example the Zeus 5000 SUV requirement writer actually wanted more than 95% reliability at 100,000 miles. Easy to say exactly what they mean in a requirement:

The Zeus 5000 SUV shall have at least 95% reliability at 100,000 miles.

Suppose in our example that the requirement writer wanted to be 90% sure based on the test results that this requirement is satisfied before we go to manufacture; we can also say this very easily.

Verification of the Zeus 5000 SUV reliability requirement shall be successful if test results provide at least 90% probability that the reliability is at least 95% at 100,000 miles.

There is no way that anyone can misinterpret these requirements. There is no way that a design and manufacturing subcontractor can provide a lower reliability while meeting the contracted for MTBF to reduce costs and increase profits.

Say what you mean, and mean what you say when you specify reliability, and you just might get what you really want. It is easy to avoid using the worst reliability requirement, and a really good idea to do so.