How to Properly Calculate System Availability

Recently received a request for my opinion concerning the calculation of system availability using the classic formula

$$ \displaystyle \large A=\frac{MTBF}{MTBF+MTTR}$$

The work is to create a set of goals for various suppliers and contractors to achieve. The calculation values derive from vendor data sheets and available information concerning MTBF and MTTR. The project is in the design phase; thus, they do not have working systems available to measure actual availability.

How would you go about improving on this approach?

Context

From what I understand, the system is actually a collection of systems supporting something like a bus station within a transit system. There are sound, surveillance, ticketing, passenger information, and similar systems that all connect to a fleet management system. The desire is to have all of these systems operate at a specific station with at least 99.8% availability.

As mentioned, this project is just setting specifications at this point and thus cannot measure actual performance.

To make the system availability problem a little more difficult, let’s say this is a new transit agency and this is the first installation of bus stations. As using similar systems actual performance, although different technology, may provide a baseline set of measurements. Other useful sets of data from existing systems my be customer or staff complaints or repair data, spares data, or work order tickets. Adding a new station to an existing system has the benefit of a rich set of data to draw from to create specifications.

Proposed Set of Calculations

As outlined above the calculation of availability is just the ratio of uptime over total time. What matters is what is included in both set of terms. Component vendors rarely know the operating expectation or conditions thus may report generic or complied MTBF and MTTR values.

For a sound system the amplifier vendor may report an MTBF value based on a Mil Hdbk 217 parts count prediction using all default setting. Or they may have used reported field failures and a few assumptions about operating time for all shipped units. The data sheet rarely specifies the source of the report data.

Repair times, the MTTR value, is problematic for a vendor to report accurately. If they report repair time it is often with the assumption of perfect and immediate diagnostics and presence of technician with tools and spare parts. MTTR in an ideal world just doesn’t happen, yet is really the only thing the vendor can control and report.

Using vendor reported MTTR will inflate the availability value as the MTTR value will be artificially low. Vendors do not have the expected maintenance and spare part policies thus unable to include them in the reported MTTR value.

Why Use MTBF and MTTR at all?

My first issue is the use of MTBF, of course. We’re not interested in the mean but rather in the onset of failures and how failure patterns may change over time. Should we plan on replacing all station sound system amplifiers every five years?

The question was missing durations of interest. Is the availability over an hour, a day of use, a week, a year, or 20 years? That matters, as availability over any specific duration will likely be different and entail a different set of risks and expectations, along with cost of ownership considerations.

If over a year, buy highly reliable components that have very low chance of failure over the first year of operation. We can avoid any concern over maintenance time as it will rarely if at all occur. We can safely defer any maintenance action till after the time period of interest.

If over 20 years, we increase the risk of uncertainty around any reliability or repair time predictions. Plus it increases the likelihood of significant wear out failure mechanisms will appear — even in electronic systems.

Using only MTBF to represent the reliability of a component or system smooths out and ignores the changing nature of the failure rates over time. Some components will have early life failures thus a decreasing failure rate for some period of time. While others will wear out, sometimes relatively quickly for specific environments and use conditions. It is rare to have a stable situation where failure rates are well modeled using only MTBF across the board.

What to Do Instead

Here’s my approach to create a set of reliability and maintenance specifications that should provide meaningful guidelines for the vendors and contractors building the bus station.

Model the system using a reliability block diagram (RBD). Create enough detail to capture repairable and replaceable elements of each system that makes up a bus station.

Perform FMEA or some form of risk assessment to determine the system’s most important elements. Identify critical to operation elements, as that provides a focus for finding the best available reliability and maintenance. Identify single-point failure elements that would shut down or seriously hamper operations. Identify the information you need to gather or improve the accuracy to populate the system RBD.

If vendors are the only source of data, ask for better data than MTBF and MTTF. Ask for the support data and distributions. Ask for effects of operating time and environmental conditions. Ask for the expected failure mechanisms and any models related use and stress to life distribution parameters. Ask for the Weibull, lognormal or appropriate distribution that describes the changing nature of reliability or repair times over time.

If vendors do not have sufficient data, talk to other transit agencies with similar systems. Ask for the data, and/or help them analyze their data so both may benefit with the analysis results. Check with professional organizations and check the literature for information on system performance and expected failure mechanisms.

Populate the system RBD with the best available data and look for areas that still need data or improved data (if the uncertainty and impact on results are large).

Estimate the cost or impact of downtime. System availability goals are high for system which provide value and have to do so regularly often with little or no interruption. Knowing the value of an hour of operation helps you balance the costs of building and maintaining the system with value of the system.

For absolutely essential data conduct accelerated life tests to create time to failure distribution estimates. Run experiments and get that data you need to create a meaningful system availability estimate.

Finally, estimate the cost of ownership for each sub system – just because it is repairable doesn’t mean your organization will have the funds to do so. It may be worthwhile to spend more up front to avoid major recurring expenses for years to come.

There are an infinite number of ways to achieve a given system availability target. Containing the options with cost of ownership and ease of maintenance may help to find the right solution.