Root cause analysis is one of the best ways to solve difficult or significant problems, but sometimes, root cause analysis efforts fail because the corrective actions weren't effective. If the original problem happens again, or the needed improvements haven't materialized, or a new problem arises because of the corrective actions, you need to figure out what happened and why so you can fix whatever went wrong. Here is a list of the issues you should be considering -- the top 5 reasons for failed root cause analysis.
My previous post about root causes in complex systems, in retrospect, looks a little bit like a rant. That doesn't bother me too much, really... but I wish I had included the following info: it is one way to go about resolving the mess that complex systems can make of your root cause analysis.
I'm getting so very tired of safety/accident researchers claiming that root cause analysis is an invalid, blame-focused practice that ignores systems and complexity. Most root cause investigators that I know are pretty well oriented towards process, organization, and system issues as the fundamental sources underlying problems and accidents... and even some of our simplest analysis tools (e.g., TWIN) include specific checks for complex-system characteristics/behaviours (e.g., hidden system responses, separation between cause and effect).
There are several discussion groups on LinkedIn dedicated to Root Cause Analysis in one way or another. I follow a couple of them, but the one I like the most has a serious problem. So, being the dynamic (ha) and proactive (haha) person that I am, I created a new one.
What would the Internet be without Wikipedia? I remember when it was still a brand new thing back in 2001... and what an awesome thing it was, and still is: the Open Source philosophy applied to human knowledge itself. Can you imagine not being able to look something up on Wikipedia, if only to get a quick précis of some topic that just caught your interest for a moment?
The quality of a Root Cause Analysis (RCA) and its Corrective Action Plan (CAP) should be evaluated many times over its lifecycle (i.e., from initial problem or event, through to final verified and sustainable improved state). Reviews occurring earlier in the lifecycle can really consider only the apparent quality of the investigation/analysis effort itself; these early reviews are what I will discuss in this article.
I've always wanted to create a knowledgebase for Root Cause Analysis... something more than a blog or collection of articles. Something like WikiPedia, but without the constant threat of wiki spam, vandals, and random people that just come along and "improve" stuff. Don't get me wrong, I love WikiPedia. However, the topic of Root Cause Analysis there has for years been a battleground of competing interests and people pushing agendas. That makes it very difficult to maintain the desired, even beloved, Neutral Point of View (NPOV) and still have meaningful topic entries with quality content. Instead, what we've gotten after many years of this process is just the scraps that most people can agree with, or don't care about enough to change. So, what I offer instead of that is the following:
I used to read a lot more than I do now... sometimes for pure enjoyment, but also for research about my favourite topics (like Root Cause Analysis, of course). Here are some of the books and reports on my root cause bookshelf right now, things I plan to read during the 4th quarter of 2014. I guess I'm in the mood for accident theory and models, and for big books!
Equipment degrades, malfunctions, or fails outright... hopefully not too frequently, but every breakdown can be painful. So, of course, you will want to figure out exactly what happened, how it happened, and why (in hardware/software terms). You may want to dig a little deeper, though; every equipment malfunction or failure that you have is a valuable datapoint for gauging the health of your overall equipment program. That's why I've started creating a tool that should be able to help find programmatic causes underneath the more easily observable equipment performance issues. Since it is intended to be a Diagnostic for Equipment Programs, and uses Tabulated Heuristics, I call it DEPTH.