Accendo Reliability

Your Reliability Engineering Professional Development Site

  • Home
  • About
    • Contributors
    • About Us
    • Colophon
    • Survey
  • Reliability.fm
  • Articles
    • CRE Preparation Notes
    • NoMTBF
    • on Leadership & Career
      • Advanced Engineering Culture
      • ASQR&R
      • Engineering Leadership
      • Managing in the 2000s
      • Product Development and Process Improvement
    • on Maintenance Reliability
      • Aasan Asset Management
      • AI & Predictive Maintenance
      • Asset Management in the Mining Industry
      • CMMS and Maintenance Management
      • CMMS and Reliability
      • Conscious Asset
      • EAM & CMMS
      • Everyday RCM
      • History of Maintenance Management
      • Life Cycle Asset Management
      • Maintenance and Reliability
      • Maintenance Management
      • Plant Maintenance
      • Process Plant Reliability Engineering
      • RCM Blitz®
      • ReliabilityXperience
      • Rob’s Reliability Project
      • The Intelligent Transformer Blog
      • The People Side of Maintenance
      • The Reliability Mindset
    • on Product Reliability
      • Accelerated Reliability
      • Achieving the Benefits of Reliability
      • Apex Ridge
      • Field Reliability Data Analysis
      • Metals Engineering and Product Reliability
      • Musings on Reliability and Maintenance Topics
      • Product Validation
      • Reliability by Design
      • Reliability Competence
      • Reliability Engineering Insights
      • Reliability in Emerging Technology
      • Reliability Knowledge
    • on Risk & Safety
      • CERM® Risk Insights
      • Equipment Risk and Reliability in Downhole Applications
      • Operational Risk Process Safety
    • on Systems Thinking
      • Communicating with FINESSE
      • The RCA
    • on Tools & Techniques
      • Big Data & Analytics
      • Experimental Design for NPD
      • Innovative Thinking in Reliability and Durability
      • Inside and Beyond HALT
      • Inside FMEA
      • Institute of Quality & Reliability
      • Integral Concepts
      • Learning from Failures
      • Progress in Field Reliability?
      • R for Engineering
      • Reliability Engineering Using Python
      • Reliability Reflections
      • Statistical Methods for Failure-Time Data
      • Testing 1 2 3
      • The Manufacturing Academy
  • eBooks
  • Resources
    • Accendo Authors
    • FMEA Resources
    • Glossary
    • Feed Forward Publications
    • Openings
    • Books
    • Webinar Sources
    • Podcasts
  • Courses
    • Your Courses
    • Live Courses
      • Introduction to Reliability Engineering & Accelerated Testings Course Landing Page
      • Advanced Accelerated Testing Course Landing Page
    • Integral Concepts Courses
      • Reliability Analysis Methods Course Landing Page
      • Applied Reliability Analysis Course Landing Page
      • Statistics, Hypothesis Testing, & Regression Modeling Course Landing Page
      • Measurement System Assessment Course Landing Page
      • SPC & Process Capability Course Landing Page
      • Design of Experiments Course Landing Page
    • The Manufacturing Academy Courses
      • An Introduction to Reliability Engineering
      • Reliability Engineering Statistics
      • An Introduction to Quality Engineering
      • Quality Engineering Statistics
      • FMEA in Practice
      • Process Capability Analysis course
      • Root Cause Analysis and the 8D Corrective Action Process course
      • Return on Investment online course
    • Industrial Metallurgist Courses
    • FMEA courses Powered by The Luminous Group
    • Foundations of RCM online course
    • Reliability Engineering for Heavy Industry
    • How to be an Online Student
    • Quondam Courses
  • Calendar
    • Call for Papers Listing
    • Upcoming Webinars
    • Webinar Calendar
  • Login
    • Member Home
  • Barringer Process Reliability Introduction Course Landing Page
  • Upcoming Live Events
You are here: Home / Articles / Don’t Stop Your RCA Investigation Too Soon

by Kevin Stewart Leave a Comment

Don’t Stop Your RCA Investigation Too Soon

Don’t Stop Your RCA Investigation Too Soon

The problem

Recently there was a power outage, that caused approximately 2,000 homes to lose power during a very cold day.  The paper headline read, “All-day outage caused by worn wiring”.

This seems like a reasonable comment and probably like many other newspaper headlines also seems to go a long way to explain what caused the 2,000 homes and business to lose power for 5 ½ hours, and the 300 that lost power for a total of 11 ½ hours.

I suspect, that many of us often just take these types of headlines at face value, and chalk it up to “it is just the newspaper” or it is just normal journalism.

I try hard to question statements like these and when I did, I thought, most of us have probably had something that failed that was traced to wires that were worn.  We all know that if a wire is worn it will cause problems, it is implied in the headline, but I think a more important question would be, what caused the worn wires?

The problem with stopping too soon is, if you aren’t careful, you may convince yourself of all sorts of things, without questioning whether there was more to the story or not.

Was there perhaps more to uncover or at least think about?  Can worn wiring by itself be the cause of this incident?    As you probably guessed I think they stopped too soon.

A simple solution?

Let me ask those that are reading this – “have you ever seen worn wiring that did not cause an actual incident?”   I have and I’ll bet others have too.

A picture is worth a thousand words so let’s put one up to discuss –

Figure 1.
Figure 1.

Figure 1 presents a graphic of a cause and effect analysis, ending up with worn wiring.  I couldn’t bring myself to put just outage caused by worn wiring down, there were just too many questions.

Thinking it though, wouldn’t the outage be caused by the lack of power which would be caused by some type of short, which in turn might have been caused by worn wiring and wiring shorting?  Lots of questions if we start putting our thoughts down on paper.

Many times I believe we are conditioned due to years of acceptance to not even question these things.  Aside from the fact that they may have missed some intermediate steps. If we stop at worn wiring the obvious solution is to replace the worn wiring.

I’m sure all of your customers, whether they be consumers or internal operations, are glaring at you, or calling on the phone, and asking you to fix this and get them up and running and back to normal.  So the wiring is replaced and everything goes back to normal.

What might the diagram look like if someone says hey wait a minute – what caused the worn wiring?  I took the liberty of guessing at what the rest of the diagram might look like just to make a point of this discussion, and have included it below.

Figure 2.
Figure 2.

Figure 2 identifies what might be the potential causes if we didn’t stop at worn wiring.  Some interesting things potentially pop up past the worn wiring stopping point.

In any investigation, it is up to the group to determine when to stop, and unfortunately, I see too many that stop too soon.  I want to reiterate that I have taken liberty with possible causes, just to make a point in this article, since I wasn’t involved in the investigation.

It is important to note that if you stop at worn wiring you will not see these causes and not seeing what is shown in Figure 2 might cause you to miss some effective solutions.

I want to be clear that I am not arguing that replacing the worn wiring is necessary and a perfectly viable solution and needs to be done.  The first order of business is to get things back to normal.  The question is will this keep the problem from recurring?

I think the answer to that depends on some definitions.

What is a solution?

The whole purpose of doing an RCA is to find the underlying causes of an incident so that we can propose solutions to prevent it from recurring.  To continue, some discussion on solutions is appropriate. so I ask the question “what makes a good solution to an incident?”

“If you can’t measure it you can’t manage it” is a quote that floats around and sometimes incorrectly attributed to Dr. Deming.

This is probably not true in all cases but I believe that if you can put a measure on it you can manage or improve it.  Let’s define some criteria on what makes a good solution, so we can manage it.

Four criteria that can be used to measure the effectiveness of a solution are:

  1. Does this solution prevent recurrence?
  2. Is this solution within your control?
  3. Does this solution meet your goals and objectives?
  4. Does this solution cause other unacceptable problems that you are aware of?

Most likely the replacement of worn wiring solution, offered above, meets all of the criteria given with the possible exception of the first one.

If the underlying causes of the worn wiring are not fixed the problem will probably recur.  In this particular case, the original wiring was installed in 1989 so it had worked for 28 years.  Not a bad run.  What do we do if we say replacement will not prevent recurrence?

Doesn’t the answer depend on the time component that you chose to use?  In the reliability equation, there is a mission time specified why shouldn’t we apply a time component to solutions?

How do we fix the problem?

I believe we must consider a phased approach.  There are most likely 3 levels of solutions

  1. Immediate
  2. Common
  3. Systemic

Let’s use the following descriptions for each of the solution types:

Immediate – solutions that only prevent an identical incident from occurring again on the same equipment, the same type of equipment or other equipment in the same part of the process.

Common – solutions that eliminate the problem and the likelihood of similar incidents in the future on other equipment throughout the process or facility.

Systemic – solutions applied to causes will result in improved management systems or processes and work culture and will prevent the likelihood of many other incidents from occurring throughout the facility or company.

By specifying the above, it allows us to say that there may be three types of solutions necessary depending on the issue.

We may choose to implement one or all three types if we have carried the investigation far enough.  In our wiring example, we need an immediate solution to get the system back up and running NOW per our customers.  But if we follow the diagram we see that there may not be an inspection process which would apply to many types of equipment within the organization, this might lead to a solution of implementing an inspection process, but that won’t immediately fix the worn wires in our case, which is why we need to replace them.

The diagram also leads us to the potential that failure modes aren’t being analyzed or looked for and that would have implications at the highest level of the organization, however implementing this solution won’t fix the immediate issue of the worn wiring either.  I think it is tough to argue that establishing a failure mode system is a valuable long-term solution.

Lessons learned

If you stop your RCA investigation too soon you may not identify those causes that will lead to solutions that extend beyond the immediate situation.

While implementing the immediate solution may get you back up and running, looking for common and systemic causes will be where there is significant opportunity to make long-term cultural changes in your company.

Looking at the overall issue we may want to look towards multiple solutions to fix the immediate issue but also look to reduce the number of common and systemic issues that exist in our organizations.

The big dollars are saved by not stopping too soon and finding common and systemic cause since they are causing multiple problems within your organization.

Filed Under: Articles, on Tools & Techniques, Reliability Reflections Tagged With: Root Cause Analysis (RCA)

About Kevin Stewart

Welcome to Accendo Reliability – join us and learn the art and craft of reliability engineering

I am an experienced educator and maintenance/reliability professional with 38 years of practical work experience in a variety of roles for ALCOA Primary Metals Group and ARMS Reliability.

« The 10 Worst Things About Business Travel
The Many Ways We Use Variance »

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Articles by Kevin Stewart
in the Reliability Reflections series

Join Accendo

Receive information and updates about articles and many other resources offered by Accendo Reliability by becoming a member.

It’s free and only takes a minute.

Join Today

Recent Articles

  • Gremlins today
  • The Power of Vision in Leadership and Organizational Success
  • 3 Types of MTBF Stories
  • ALT: An in Depth Description
  • Project Email Economics

© 2025 FMS Reliability · Privacy Policy · Terms of Service · Cookies Policy