Four Reasons to Rethink your Reliability Improvement Journey

The term “reliability improvement journey” is well-established in the chemical process industry. The decade-long, tortuous journey of one company is shown in terms of operational availability (i.e., production) and relative maintenance cost at Figure 1.

Chart One organization reliability improvement journey, plotting operational availability and relative maintenance costs over 16 yeasrs — **Figure 1 : The reliability improvement journey, adapted from [1].**

The length of a company’s reliability journey reflects the maturity of the “reliability culture”. Here, the term “reliability culture” may be described technically as “the extent to which each decision vector aligns with the company’s target vector”.

It logically follows that the reliability journey may be significantly shortened simply by improving the quality of each decision made by the reliability organization. But how?

Four reasons for a long and arduous reliability journey are presented below. It is intended that these reasons prompt you to critically rethink how you approach the reliability engineering problem in your plant.

Reason 1 – An important role in your reliability organization is vacant.

The production system is, as the term implies, a “system”. Comprised of assets, process units, operational logic, storage tanks, supply chains, failure mechanisms, maintenance processes, etc..

This calls for a “systems engineering approach” to the reliability problem, which in turn requires the appointment of a “Systems Reliability Engineer” (SRE). This is an engineering discipline of its own, which – to my knowledge – is not explicitly taught for application in the unique context of a chemical production plant.

As depicted schematically at Figure 2, the role of the SRE is to align and direct the reliability improvement efforts of the reliability organization. That is, to ensure that they are working on the right topics and the ensure that each decision vector aligns with the company’s target vector.

listing of members of the reliability organization including systems reliability engineer, plant manager, process engineer, production expert, logistics expert, reliability engineer, corrosion engineer, inspection engineer, and maintenance engineer. — **Figure 2 : An (incomplete) example of a reliability organization.**

Precisely how the SRE accomplishes the abovementioned tasks of direction and alignment are largely outlined in Reasons 2 to 4, described below.

Reason 2 – Your targets are poorly defined.

The performance of your production system is described in multiple dimensions (e.g.: production volume and maintenance cost) and varies from year to year according in a probabilistic function that you have probably not characterized. Further, the achieved performance in a given year may be largely determined by events that are outside of your control. It is likely that the reality of this situation is not adequately accounted for in your target-setting process or in your reliability improvement plan.

The adoption of a systems reliability engineering approach requires that the current stochastic performance of the production system be estimated as a basis for target-setting; refer Figure 3. The “target vector” is defined as the gap between the current performance and the target performance and is the basis for aligning the efforts of the reliability organization.

Plot of poor targets using pdf of avaiability distributions. — **Figure 3 : Visualization of the current and target system performance in terms of a Probability Density Function (PDF).**

Figure 3 demonstrates that targets in stochastic systems are best specified in terms of two parameters, i.e.: FAIL and TARGET criteria. This practice enables the required performance improvement to be visualized and quantified.

Reason 3 – Your strategy to reduce “waste” is incomplete.

A typical reliability improvement plan is comprised almost solely of methods that focus on reducing “waste”. That is, hazards that may lead to a production loss. These methods can be characterized in terms of being proactive or reactive in nature, as shown at Table 1.

Table 1 : Examples of proactive and reactive reliability improvement methods.

Proactive	Reactive
Failure Modes and Effects Analysis (FMEA)Reliability-Centered Maintenance (RCM)Risk-Based Inspection	Root Cause AnalysisDefect EliminationBad Actor Program

In the absence of an overarching systems reliability approach, a reliability improvement plan that focuses solely on reducing “waste” is likely to result in a long, arduous reliability journey, for the following reasons:

The proactive methods tend to be largely theoretical exercises with no strong coupling to the system performance vector(s). It is therefore practically not possible to reach an “optimum” solution. That is, it is not possible to align the decision vector with the target vector.
The reactive methods target a sub-set of the possible future system hazards which, once alleviated, will be quickly replaced by newly recognized hazards. This is a characteristic of the complex stochastic production system. Hence, the extent to which the anticipated gains will be achieved in practice may be highly uncertain. Further, experience has shown that significant knowledge and experience may be required to develop robust and economically viable solutions. The extent to organizations have access to the required resources (technical, financial and time) is highly variable.

A systems reliability engineering approach will additionally consider the application of capacity “growth” strategies, such as debottlenecking and expansion projects. These types of improvement measures are usually able to be tightly coupled to system performance targets and are certainly able to be planned with a higher degree of confidence.

The task of the SRE is to ensure that company resources are wisely invested. This may be done by quantifying the impact of each improvement measure in terms of stochastic system performance.

Reason 4 – You are using the wrong tool for the job.

Whilst most reliability literature is concerned with “product” reliability engineering, the described methods (e.g., Weibull analysis and FMEA) find relatively little application in a process plant environment. At first glance, the reason for this would seem to be the ratio of (many) Assets to (few) Engineers. However, the real reason is much more interesting. It is because the traditional methods were developed for application in “simple” and “complicated” systems, whereas a process plant is a “complex” system.

The response to this situation has been to trivialize the complex system behavior, for example in the form of a risk matrix. This approach, however, prohibits the realization of optimal outcomes. An alternative response would be to apply methods suited for application in complex systems. For example, simulation is absolutely necessary to make optimal decisions in complex systems.

The results of a high-level simulation of a process plant, representing the current system performance, are presented at Figure 4.

A picture containing chart Description automatically generated — **Figure 4 : Left: High-level Block Flow Diagram (BFD) of the production system; Right: Estimated stochastic performance of the production system in relation to the performance targets.**

The developed model also provides a basis for evaluating the merits of proposed measures for improving production system performance. You decide where you are headed: promotion, demotion or mediocrity!

Summary

A technical, systems engineering approach to the process plant reliability engineering problem is neither well-described in the literature, nor well-supported by appropriate tools in the practice.

RAMS Mentat GmbH has developed an innovate technical and systems engineering approach – and supporting tool – that enables the reliability and safety performance of an entire production system to be optimized with consideration of capital investment, operational and maintenance cost constraints.

One more good reason to rethink how you approach your reliability improvement journey!

References

[1] “Reliability – How Industry Leaders Take Advantage of this Often-Overlooked Improvement Opportunity,” Solomon Associates, 31 05 2021. [Online]. Source: https://www.solomoninsight.com/blog/reliability-how-industry-leaders-take-advantage-of-this-often-overlooked-improvement-opportunity.

Comments

James Reyes-Picknell says
December 13, 2021 at 4:46 AM
I think there’s more than just a technical alignment that’s needed. The article does a great job explaining the technical challenge, but isn’t it a bigger hurdle getting senior management aligned? Changes involved in solving this complex problem require multi-disciplinary approaches (as shown) and those need the support of the multi-disciplinary managers who typically have competing priorities. At the level of engineer and manager where we often work, we are dealing with those, including middle-level managers, who can say “no” to what we are doing or to their participation. They don’t have the authority to say “yes” and have little or no motivation to stick their necks out. We rarely seem to deal with the senior levels who can actually say, “yes”, and sponsor it to happen.
- Andrew Kelleher says
  December 13, 2021 at 7:46 AM
  That is a very nice comment, which I agree reflects the current reality. In the absence of a tool capable of quantifying the complex system behaviour, decisions are often made using “intuition”. This almost inevitably leads to frustration since the decision basis is unclear. The right approach (and tool), however, can make the decision-making process much more transparent. Every “decision” can be formulated as a choice between two options, i.e.: Option A (Status Quo) or Option B (Alternative Future). It follows logically that a decision is always made. The expected outcome associated with each option can be quantified (via simulation, with agreed data from the cross-functional experts) in terms of multiple and/or competing performance criteria. In the case shown, the competing criteria are “Lost Production Cost” and “Maintenance Cost”. There is, however, no reason why other performance criteria cannot be simulated. Competing criteria, however, are no reason for not making a decision. The beauty is that the expected outcomes of Option A and Option B are quantified (based on data and without emotion) and documented; the best possible basis for a decision. Hence, the decision-maker does not need to rely so heavily on his “intuition” but rather on the combined knowledge of his cross-functional team, quantified in terms of the estimated impact on system performance. Don’t be afraid the challenge me with a concrete example!
John Bessman says
December 13, 2021 at 9:49 AM
Hi Andrew,
Terrific article and I found myself nodding in agreement many times. One question I had was how you describe the difference(s) between a “complicated” and a “complex” system. I think I intuitively understand it, but it’s a concept I’ve struggled to “explain upwards” to our decision makers.
Thanks!
- Andrew Kelleher says
  December 13, 2021 at 10:31 AM
  Hello John,
  it is a good question, which I have also researched. The behaviour of a simple system (e.g. a car key) is easily knowable and reproducible. The behaviour of a complicated system (e.g. a car) can be “known” with many structured expert steps, e.g. via defining and characterising the heirachical structure of components. Solutions that work with complicated systems, however, do not work well with complex systems (e.g. car traffic), which involve too many unknowns and too many interrelation factors to reduced to rules and processes.
  Complicated systems are, for example: Homogeneous, Linear, Deterministic, Static, Independent and Without Feedback. In contrast, complex systems are: Heterogeneous, Non-Linear, Stochastic, Dynamic, Interdependent and With Feedback.
  The “Stacey” matrix (https://drawingchange.com/project/simple-complicated-and-complex-decision-making-new-visual/) provides a nice visual depiction of types of decision-making strategies in different system types. Simulation is not listed, though is suitable for complex systems. Thinking of the current supply chain problems, I think now is definitely the time to be “focusing on stability”; in my opinon pretty difficult to do well, without simulation.
  Best regards, Andrew Kelleher.
Christos Christoglou says
December 22, 2021 at 4:14 AM
I don’t know what I liked more, the article itself or your answer to John Bessmann and explanation about complicated and complex systems.
Thanks for both!