The Maintenance & Reliability Series

Short articles on maintenance and reliability engineering subjects.

James Kovacevic is the primary author writing articles for the series.

Never miss an article by signing up for the Maintenance & Reliability Series list to the right. Receive an update weekly highlighting the lastest article.

Let us know your reaction and thought, plus any questions. Please use the comments section below each article.

by James Kovacevic Leave a Comment

Failure Reporting, Analysis and Corrective Action System (FRACAS)

Using a System to Record, Report And Eliminate Defects

Why is that some organization seem to break the reactive cycle and others don’t? After all most organizations have a PM program and some form of a planning and scheduling program right? The key difference between those that do is their ability to use their failure data and systematically eliminate defects and issues from the processes and equipment. This doesn’t mean adding a new PM everytime some fails, which just won’t work.

To eliminate the defects and issues, the organization needs to collect meaningful data to analyze and act on. This is where FRACAS comes in. [Read more…]

by James Kovacevic Leave a Comment

Establishing the Frequency of Failure Finding Maintenance Inspections

Preventing The Consequences Of A Hidden Failure From Devastating Your Organization.

Ever wonder how some of the worst industrial disasters occur? It is usually the result of multiple failures. Failure of the primary system and failure of the protective systems. Ensuring the protective system(s) are not in a failed state should be of utmost importance to any organization. But how often should we test the protective systems to ensure the required availability?

Establishing the correct frequencies of the inspection/ testing activities of these protective system(s) is critical to not only the success but safety and reputation of any organization. Too infrequently and the organization is at risk of a major incident. Too frequently, and the organization is subjected to excess planned downtime, an increased probability of maintenance induced failures and increased maintenance cost.
This article will continue the discussion on establishing the correct inspection frequency in a maintenance program. There are three different approached to use, based on the type of maintenance being performed;

This article will focus on Failure Finding Maintenance.

What Are Protective Systems, Hidden Failures and Failure Finding Maintenance

A protective system or device is a system or device which is designed to protect and mitigate or reduce the consequences of failure. These consequences may be safety, environmental or operational in nature. These devices or systems are designed to;

Alert – to potential problem conditions (i.e. alarm)
Relieve – prevent failure conditions causing greater problems (i.e. pressure relief valve)
Shutdown – stop a process to prevent greater problems from occurring (i.e. motor overload)
Mitigate – alleviate the consequences of a failure (i.e. fire suppression equipment)
Replace – continue to provide a function by an alternative means (i.e. back up pump)
Guard – prevent an accident from occurring (i.e. E-Stop)

Knowing what a protective device or system is, you may see that if a pressure relief valve became corroded and seized in the closed position, it would not be evident to the operators. This is a hidden failure. A hidden failure can be defined as; a failure which may occur and not be evident to the operating crew under normal circumstances if it occurs on its own. Obviously, this could lead to significant consequences if the tank that the pressure relief valve is protecting is overpressurized. This is where failure finding maintenance comes in.

Failure-finding maintenance is a set of tasks designed to detect or predict failures in the protective systems or devices to reduce the likelihood of a failure in the protective system and the regular equipment from occurring at the same time. So how to do you determine how often the protective systems should be checked for failure? Establish the frequency using a formula.

Establishing Failure Finding Maintenance Frequencies Using Formulas

There is a single formula that will take into consideration of all variables to establish the failure finding interval (FFI); FFI = (2 x M_TIVEx M_TED) /M_MF

Where;

M_TIVE= MTBF of the protective device or system
M_TED= Mean Time Between Failure of the Protected Function
M_MF= Mean Time Between Multiple Failures

So if we use an example from RCM2, we can see how this works; The users of a pump and a standby pump want the following from the system.

The probability of a multiple failure to be less than 1 in 1000 in any one year (M_MF)
The rate of unanticipated failures of the duty pump is 1 in 10 years (M_TED)
The rate of unanticipated failure of the standby pump is 1 in 8 years (M_TIVE)

Therefore the correct failure finding interval would be;

FFI = (2 x 8 x 10) / 1000
FFI = (160)/1000
FFI = 0.16 years
0.16 years x 12 months = 2 months

This indicates that the standby pump must be checked every two months to verify it is fully operational. If this check is not performed, the likelihood of a multiple failures increases.

Lastly, if the failure of the protective device can be caused by the failure finding task itself, there is another approach to be used, which is beyond the scope of this article.

Do you have a program in place to check your protective systems? If not, are you aware of the risk that your organization is exposed to? Take the time to determine your protective systems and establish your failure finding tasks.

Remember, to find success; you must first solve the problem, then achieve the implementation of the solution, and finally sustain winning results.

I’m James Kovacevic
Eruditio, LLC
Where Education Meets Application
Follow @EruditioLLC

References;

by James Kovacevic Leave a Comment

Establishing the Frequency of On-Condition Maintenance Inspections

Ensuring The Inspections Will Catch the Defect Before A Functional Failure Occurs

Ever wonder how some organizations make their vibration or thermographic program work, and not only work but deliver huge results to their organization? They use a systematic approach to establishing the correct frequencies of inspection. Establishing the correct frequencies of maintenance activities is critical to the success of any maintenance program. Too infrequently and the organization is subjected to failures, resulting in poor operational performance. Too frequently, and the organization is subjected to excess planned downtime and an increased probability of maintenance induced failures.

This article will continue the discussion on establishing the correct frequency in a maintenance program. There are three different approached to use, based on the type of maintenance being performed;

Time-Based Maintenance
On-Condition Maintenance
Failure Finding Maintenance

This article will focus on On-Condition Maintenance. While establishing the frequency for Fixed Time Maintenance activities is complex and is more of science, establishing the frequency for Condition Based Maintenance inspections (or On-Condition) is a mix of science and art.

Construct the P-F Curve & Establish the P-F Interval

The first step to determining the inspection frequency for on-condition tasks is to construct the P-F curve and P-F interval. Constructing a P-F curve requires recording the results of the inspection and plotting the result versus the elapsed time. If enough measurements are taken, a fairly consistent curve can be developed for each failure mode. Making sure that the data is gathered carefully and consistently will aid in increasing the quality of the P-F curve. Lets use an example from RCM2;

The tread depth on a tire is directly related to the linear distance traveled. Based on the data collected, it is safe to say that for every 3000 miles the tire wears 1mm. So for a tire with 12mm tread when new, a potential failure point of 3 mm and a failure point of 2mm, the P-F interval is 3,000 miles.

Now this works quite well for linear P-F curves because it is predictable. So how do you construct a P-F curve for a non-linear failure mode? It is a bit more complex, and a bit more of art. Let’s use another example;

A bearing will operate with minimal vibration under normal operations. As a defect materializes, the vibration will increase exponentially as the defect gets worse. While the P-F Interval will be the time (or operating cycles) from the point the defect can be detected (potential failure point) to the point it becomes a functional failure, its rate of deterioration will increase dramatically towards the end of its life. This can be quantified just as the tire in the above example, with the right data.

With P-F curve and P-F Interval (PFI) established, the frequency can be determined.

Select the Right Frequency for Inspection

Once the P-F Interval (PFI) is established, the inspection frequency can be determined. Thankfully it is not as complicated as establishing Fixed Time Maintenance frequencies. To determine the inspection frequency, the formula is either PFI/3 or PFI/5.

Standard Inspection – the frequency of inspection for most equipment should be approximately 1/3 of the P-F interval (Formula = PFI/3). For example, a failure mode with a P-F interval of 3000 miles should be inspected every 1000 miles.
Critical Equipment Inspection – the frequency of inspection for critical equipment should be approximately 1/5 of the P-F Interval (Formuala = PFI/5). For example, a failure mode on a critical piece of equipment with a P-F interval of 3000 miles should be inspected every 600 miles.

Now the above works well for linear P-F curves, so how do you establish the frequency for the non-linear curves? You use the same approach as above for the initial inspection frequency.

However, once a potential failure is detected, additional readings should be taken at progressively shorter intervals until a point is reached that a repair action must be taken. For example; the initial inspection frequency is every four weeks. Once a defect is detected, the next inspection will be at three weeks, then two weeks and then ever week.

This is only guidelines and should be adjusted based on the method used to track and trend data, the lead time of the repair parts (if not kept on site), and how quickly the data will be analyzed, and the repair work planned. If your planning process is poor, the frequency should be more frequent, to allow for a high chance of detection sooner.

How much thought was put into your Condition Based Maintenance inspection frequencies? Have you broken down each failure mode trended the data and established the frequency using a systematic approach? As with the Fixed Time Maintenance activities, you may be over or under inspecting, costing your organization reliability or money.

Remember, to find success; you must first solve the problem, then achieve the implementation of the solution, and finally sustain winning results.
I’m James Kovacevic
Eruditio, LLC
Where Education Meets Application
Follow @EruditioLLC

References;

by James Kovacevic Leave a Comment

Establishing Fixed Time Maintenance Intervals

How to Select The Optimum Fixed Time Maintenance Intervals

Think about your maintenance program. How often are your PMs scheduled? How were those frequencies established? If you are in the majority, the chances are that the frequencies were either established from the OEM manual, or by someone in the department without data.

Establishing the correct frequency of maintenance activities is critical to the success of any maintenance program. Too infrequently and the organization is subjected to failures, resulting in poor operational performance. Too frequently, and the organization is subjected to excess planned downtime and an increased probability of maintenance induced failures. So how do you establish the correct maintenance frequencies for your organization? There are three different approached to use, based on the type of maintenance being performed;

Time-Based Maintenance
On-Condition Maintenance
Failure Finding Maintenance

This article will focus on Time Based Maintenance Tasks.

Time-Based Maintenance Tasks

“The frequency of a scheduled task is governed by the age at which the item of or component shows a rapid increase in the conditional probability of failure” (RCM2). When establishing frequencies for Time Based Maintenance, it is required that the life be identified for the component based on data.

With time-based failures, a safe life and useful life exists. The safe life is when no failures occur before that date or time. Unless the failure consequence is environmental, or safety related, the safe life would not normally be used. The useful life (economic life limit), is when the cost of consequences of a failure starts to exceed the cost of the time-based maintenance activity. There is a trade-off at this point between the potential lost production and the cost of planned downtime, labour, and materials.

So how is the safe life or useful life established? It is established using failure data and history. This history can be reviewed using a Weibull Analysis, Mean Cumulative Failure Analysis or even a Crow-AMSAA Analysis to statistically determine the life of the component. Once that life is determined using a statistical analysis, the optimum cost effective frequency must be established.

Establishing the Optimum Economic Frequency

This formula is used to establish the economic life of the component, balancing the cost of the downtime vs. the cost of the replacement.

Where;

C_T= The total cost per unit of time
C_f= The cost of a failure
C_P= The cost of the PM
T = The time between PM activities

The formula will provide the total cost based on the maintenance frequency. Since the calculation can be time-consuming, Dodson developed a table which can be used if;

The time to fail follows a Weibull Distribution
PM is performed on an item at time T, at the cost of C_P
If the item fails before time = T, a failure cost of C_f is incurred
Each time a PM is performed, the item is returned to its initial state “as good as new”

Therefore when using the table, use formula; T=mѲ+δ. Where;

m is a function of the ratio of the failure cost to PM cost and the value of the shape
Ѳ is the scale parameter of the Weibull distribution
δ is the location parameter of the Weibull distribution

In the example below, you can see how the table can be used with the formula;

The cost for a PM activity $60. The cost of a failure for the same item is $1800. Given the Weibull parameter of B=3.0, O=120 days, and δ =3 how often should the PM be performed?

C_f/ C_P = x
1800/60 = 30

The table value of m given a shape parameter B of 3.0 is 0.258. Therefore;

T=mѲ+δ
T = (0.258)(120)+3 = 33.96
T = 34 days for each PM

As you can see, determining the frequency of Fixed Time Maintenance tasks is not as simple as picking a number out of a manual or based on intuition. Armed with this information, a cost effective PM frequency based on data can be developed for your Fixed Time Maintenance tasks. This will ensure the right maintenance is done at the right time, driving your plant performance further.

Does you Fixed Time Maintenance Tasks have this level of rigor behind them? Why, not? After all, your plant performance (operational and financial) depends on it. Stay tuned for next week’s post on establishing frequencies for On-Condition tasks.

Remember, to find success; you must first solve the problem, then achieve the implementation of the solution, and finally sustain winning results.

I’m James Kovacevic
Eruditio, LLC
Where Education Meets Application
Follow @EruditioLLC

References;

by James Kovacevic Leave a Comment

Living With The 6 Failure Patterns

How To Manage Each Failure Pattern With An Effective Maintenance Strategy

Most maintenance and reliability professionals have seen the six failure patterns (or failure hazard plots), described by Nowlan and Heap. In case you are unfamiliar with them, you can learn more about them in a previous article on them. Here is a quick summary to jog the memory, just in case.

A. Bathtub Curve – accounts for approximately 4% of failures
B. Wear Out – accounts for approximately 2% of failures
C. Fatigue – accounts for approximately 5% of failures
D. Initial Break-In – accounts for approximately 7% of failures
E. Random – accounts for appoximately 14% of failures
F. Infant Mortality – accounts for approximately 68% of failures

From the above, you can see that the majority of failures experenced are not directly related to age, but are the result of random or induced failures. So how does this help when establishing a maintenance program? First, we must understand what the patterns tell us.

What Types of Failure Modes Do The Failure Patterns Relate to?

Looking at the different failure patterns, we can group the types of failures into three unique groups;

Age-Related failures – The term “life” is used to describe the point at which there is a rapid increase in the likelihood of failure. This is the point on the failure pattern before it curves up. Typically these types of failures can be contributed to wear, erosion, or corrosion and involve simple components that are in contact with the product.
Random failures – The term “life” cannot be used to describe the point of rapid increase in the likelihood of failure, as there is no specific point. These are the flat parts of the failure curve. These types of failures occur due to some introduced defect
Infant Mortality – The term “life” cannot be used here either. Instead, there is a distinct point at which the likelihood of failure drops dramatically and transitions to a random level.

Understanding these unique differences, an effective maintenance strategy can be developed.

What Maintenance Needs to Be Done for Each Failure Pattern?

The maintenance activity selected has to be right for the specific failure pattern. When looking at the failure patterns, there are three unique types of activities that can be put in place to address all points in the failure curve.

Age-Related – These types of failures can be addressed through fixed time maintenance. Fixed time maintenance includes replacements, overhauls, and basic cleaning and lubrication. While cleaning and lubrication will not prevent the wear out or corrosion, it can extend the “life” of the equipment.
Random – These types of failures need to be detected, as they are not predictable, or based on a defined “life.” The equipment must be monitored for specific indicators. These indicators may be changes in vibration, temperature, flow rates, etc. These types of failures must be monitored using Predictive or Condition monitoring equipment. Cleaning and basic lubrication can prevent the defects from occurring in the first place if done properly.
Infant Mortality – These types of failures cannot necessarily be addressed through fixed time, predictive or condition-based maintenance programs. Instead, the failures must be prevented through proper design & installation, repeatable work procedures, proper specifications and quality assurance of parts.

Only when a maintenance program encompasses all of the above activities, can plant performance improve.

Determining the Right Frequency of Maintenance Activities for Each Failure Pattern

So with all of the activities taking place, how is it possible to know when each fixed time activity or condition monitoring inspection take place? The approach to determining the frequency of activities for fixed time and condition monitoring inspections are different. However, before the approaches are discussed, it should be noted that MTBF should NOT be used to determine the approach… EVER (sorry, the rant is over).

Fixed Time Maintenance – The frequency for fixed time maintenance activities should be determined using a Weibull analysis. Also, there may be regulatory requirements which specify the frequency of these activities. This will provide an ideal frequency to perform these types of activities
Condition Monitoring – The frequency for condition monitoring activities should be determined by using the P-F Curve and P-F Interval. This approach requires an understanding of the ability of monitoring technology, the defect being monitored, degradation rates, and the ability of the organization to react to the information gathered during the monitoring program. This will be furthered discussed in next weeks post.

I hope this has provided some clarity around how you should be using the six failure patterns in your maintenance strategy. Do you have specific activities in your program to address age-related, random and infant mortality failures? If you only have fixed time maintenance activities in your program, what are leaving on the table?

Remember, to find success; you must first solve the problem, then achieve the implementation of the solution, and finally sustain winning results.

I’m James Kovacevic
Eruditio, LLC
Where Education Meets Application
Follow @EruditioLLC

References;

by James Kovacevic Leave a Comment

The Importance of a Learning Culture

Ensuring Performance and Long Term Sustainability of Your Maintenance & Reliability Program

Imagine working in an organization that does not provide training or has zero tolerance to taking a risk, trying something new and failing. Or it is expected that you have all of the answers and do not need any assistance ever. Sound familiar? If it does, how is the performance of your plant? Chances are it is not as good as it could be. This example is great at illustrating what a learning culture does not look like.

“A learning culture is a set of organizational values, conventions, processes, and practices that encourage individuals—and the organization as a whole—to increase knowledge, competence, and performance.” A learning culture is vital to the long-term sustainability of any maintenance & reliability program and improving plant performance.

If you don’t have an organization that believes in training, or risk taking or learning from failure, what do you do? You can take steps to build a learning culture. The first step is to recognize the concern. The concern could be around cost, past returns on training, or experience that says the employee will leave after receiving the training. Whichever it is, it must be addressed.

Also, any organization can start to develop a learning culture by doing the following;

Formalize training and development plans for each individual. These plans should include all mandatory training as well as specific training that will allow each person to grow in their current and future positions

Give recognition to learning by promoting and celebrating those that learn new skills and gain new knowledge. As recognition is given to those with new skills, other will want to participate.

Get feedback on the type, quality, and applicability of the training. This will ensure that relevant and effective training is being provided.

Promote from within. This creates a willingness and desire to learn as the staff knows they have an opportunity to grow within the organization.

Develop a knowledge management process. It should be a formal process with participation required by all.

I recently had the opportunity to work with two great organizations. Both organizations had recognized the need for assistance. They were looking to make improvements in areas in which they had no experience, but they had a willingness to learn. They did not want a “turn key” solution but instead wanted to build the capability of their internal team, let them develop the solution and implement the solution.

There was and will be some follow-up support, but here are two organizations that are not only investing in their people with training but allowing them to take the risk, learn and grow. Talk about ownership; these were some of the most passionate people that I have had the pleasure to work with. It is always a pleasure to work with organizations such as this, and I am truly enjoying watching the team come together and grow.

People are the heart of any improvement, so make sure you invest in them and create a learning culture. In closing, I ask you to think about the following, “What if we train the staff and leave?”, but the better question is “What if we don’t train them and they stay?”

Remember, to find success; you must first solve the problem, then achieve the implementation of the solution, and finally sustain winning results.

I’m James Kovacevic
Eruditio, LLC
Where Education Meets Application
Follow @EruditioLLC

References

by James Kovacevic Leave a Comment

Top 10 Reasons Your Planning & Scheduling Program Is Failing

How to see if your Planning & Scheduling program is failing to return value to the organization

Maintenance Planning & Scheduling is one of the most important processes in the maintenance function. Without it, work will not be completed on time, nor will it be efficient. So why, is the maintenance planning & scheduling process often ignored, or not implemented successfully? [Read more…]

by James Kovacevic Leave a Comment

The Moving Target of Excellence

Why the End of the Maintenance & Reliability Journey Is Never Over

dartboard target aim goal achievement concept

I am often asked, what is the benchmark for a particular KPI. At first, I would quickly answer the target from the SMRP Best Practices Guide. Depending on the organization and the maturity, I would either see their faces light up or see them shut down. If they shut down, what momentum was present, quickly vanished. If they were meeting the target (and the KPI and supporting data checked out), the momentum would fade a bit, as they were hitting the target.

[Read more…]

by James Kovacevic Leave a Comment

Operating at Peak Inherent Capability

Why You Cannot Operate Above the Inherent Capability Sustainably

I recently had the opportunity to teach a Body of Knowledge course, which was full of great questions from the students. One of the questions was about inherent vs. actual availability. This had me thinking about the choice that organizations make on how they choose to run their business and more importantly, their resources.

There are many times when a resource is operated at Peak Inherent capability, with the intention of getting the most out of the resource. while this is a good practice, many organizations try to operate the resource at greater than the inherent capability of the resource. So, what does this do to the resource? Well, it could mean short-term financial gains, achieving the schedule, or if done for a sustained period of time, it could be detrimental to the resource.

[Read more…]

by James Kovacevic Leave a Comment

Understanding the SMRP Body of Knowledge

If you have been in maintenance or reliability for a period time, there is little doubt that you haven’t heard about the SMRP Body of Knowledge yet. The SMRP body of knowledge is more than just a document that outline of topics related to maintenance & reliability. It is a framework in which the CMRP exam is based on and can be used as a framework to improve your facility’s performance.

[Read more…]

by James Kovacevic Leave a Comment

Taking Reliability Block Diagrams to the Next Level

Using RBDs to model different systems and circumstances

In the previous post, the basics of a Reliability Block Diagram were covered using simple Series or Parallel paths. In real life, most systems or processes are not that simple and require a different level or type of models, often used in combination with other types.

So in our continued exploration of RBDs, let’s explore a few different models that may be used. [Read more…]

by James Kovacevic 1 Comment

Understanding Reliability Block Diagrams

How To Evaluate The Reliability Of A System Or Process

60% of failures and safety issues can be prevented by ensuring there is a robust equipment design and that Maintenance & Reliability is taken into account during the design phase. Equipment should be designed with the following in mind:

Designed for Fault Tolerance
Designed to Fail Safely
Designed with early warning of the failure to the user
Designed with a built-in diagnostic system to identify fault location
Designed to eliminate all or critical failure modes cost effectively, if possible.

[Read more…]

by James Kovacevic Leave a Comment

The Role of Software In Reliability Engineering

Deciding if Software is Right For Your Program

Let’s face it, the field of reliability engineering is diverse and full of statistics, models and detailed analysis. The detailed calculations, the building of models and analysis, have been performed with great success in the past and currently. The models built through manual calculation have been successful and demonstrated the importance of reliability engineering.

[Read more…]

by James Kovacevic Leave a Comment

Focus on the Important Issues, Not the Many Issues

Utilizing the Pareto Method to Prioritize Improvement Activities

There is limited time, money or resources in every maintenance department. Sometimes you have 2 of the 3, sometimes just 1. So how do you prioritize the items or issues that will have the biggest impact on your facility? There is a simple, yet vital principle that can be used in your facility to determine which issues to focus on. This principle started in a garden in Italy while studying peas… This principle which started with an observation of peas can have an important impact on your operation.

[Read more…]

by James Kovacevic Leave a Comment

How to Deliver Sustainable Gains in Maintenance Planning

Utilizing the PDCA Methodology in Work Planning

It is well known that maintenance planning & scheduling can deliver significant improvements in the efficiency and effectiveness of the maintenance department. Maintenance planning & scheduling seems simple enough, plan the work and schedule it to be done at the most opportune time. However, why is it that the organization seem to struggle with realizing the benefits of maintenance planning & scheduling? In my experience, I have seen organizations that focus on the scheduling portion of work management, while not fully planning the work. Doc Palmer (an authority on Maintenance Planning & Scheduling) has said that you cannot schedule without proper planning. So how is it that they are scheduling work without knowing what needs to be done and what materials are required?

[Read more…]