Root Cause Analysis – Large or Small Events?
Root Cause Analysis (RCA) can be applied to events of any size or significance. However, it's usually applied to large events, i.e. those with serious consequences. Even so, it can and should be applied to smaller events as well. Statistically, smaller events are more likely to occur than larger events. Thus, application of RCA to small events may identify many significant opportunities for improvement.
Given that smaller events are more likely to occur, should we focus our RCA efforts solely on smaller events? This would have the advantage of ensuring that we have a statistically significant sample from which to draw learning opportunities. Why, then, do we expend so much effort applying RCA to large events if we can get the same (or better) benefits by focusing on small events? This idea could be expressed as follows:
Little events happen all the time. We should analyze each little event. After we have enough observations, we will have a statistically significant sample. This should be the basis for our learning.
Instead, we analyze the big events because they catch our attention. Big events come around only once in a while. We spend a lot of time investigating them. However, we have only one sample point. Therefore, our results have little statistical significance.
By emphasizing investigation of the big events, we are potentially learning the wrong things because we may be placing too much emphasis on issues that have very little statistical significance.
Is this a valid idea? Should we emphasize RCA of small events, and perhaps do away with RCA of large events altogether? I'll try to answer that question in this article.
There is a common belief that large events and small events have the same causes. Therefore, it is assumed that by analyzing small events and applying lessons learned from them, we prevent large events as well. However, using this strategy, do we limit the severity of potential future events?
Suppose we analyze only small events. We'll have a lot of data on common event initiators and latent conditions. As we'll have a lot of data, we'll develop a very good understanding of the events and our corrective actions will be very good. We'll knock down the frequency of these events by a significant amount, perhaps even eliminate them completely.
Again, we have to ask the question, have we limited the severity of potential future events? If we assume that all events, large and small, have the same root causes, then the answer is yes. Is this true though? What makes a small event different from a large event?
Speaking very generally, it's the interaction of various latent conditions. Some of these latent conditions may be deeply embedded in the operations of our systems. They may be very subtle conditions that will not be activated very often. With a low probability of occurrence, we won't have much data on them and we may not have any protections against them.
They may be very simple conditions that, under ordinary circumstances, cause no problems for us. Its when circumstances change in unexpected ways that these kinds of conditions become a real danger. An event that might ordinarily terminate with very low consequences could, under less common circumstances, terminate with very serious consequences.
Consider a condition like grinder kickback. This can occur when using a grinder because the grinder "catches" on whatever's being worked on, and the rotational force of the spinning grinder wheel causes the entire tool to kick back toward the operator. Standard safety precautions while using such a tool include maintaining a proper stance and appropriate distance from the grinder. Kickback is a known condition, and under most conditions, is easily compensated for.
Now, throw in a twist. A worker decides that, in a standing or kneeling position, he can't get a good angle on whatever he's grinding. He decides that the best, fastest way to get the job done is to lie down on the floor, and hold the grinder above him to get at the bottom of the piece he's grinding. He has every intention of being very careful. However, he has just removed his ability to avoid a kickback if it occurs. The weight of the grinder is now working against him, as well.
The job starts out fine. Then the grinder catches on something. It kicks back. The worker can't avoid it. The mechanics of the event are such that the grinder moves laterally towards the worker's head. The worker receives an extremely serious laceration to his face.
This is a "large" event. You would never have expected it to happen. The circumstances of the event were unusual. The probability of the event happening again appears to be low. Should we subject this event to a detailed root cause analysis?
Of course we should! We should investigate and analyze the heck out of this event. However, we must not limit ourselves to the question of "why did the worker use the grinder that way." We must instead find out "what is it about the way we do business that: set up this situation, forcing the worker to make this choice; convinced the worker that he needed to do the job this way; kept him from taking more time to get a different tool or to rotate the piece he was working on."
I'm not making this up. It actually happened two years ago. The worker required extensive reconstructive surgery to one side of his face. It was pure luck that he didn't lose his nose or one of his eyes.
In conclusion, my belief is that we must investigate and analyze the sporadic, large events. So what if the probability of occurrence is low? Remember that risk is probability times consequences. If the potential consequences are high, we must do what we can to prevent those consequences from occurring -- even if it is a low probability event. Sometimes, a sample of one is more significant than a sample of thousands.
by Bill Wilson