My previous post about root causes in complex systems, in retrospect, looks a little bit like a rant. That doesn't bother me too much, really... but I wish I had included the following info: it is one way to go about resolving the mess that complex systems can make of your root cause analysis.
I'm getting so very tired of safety/accident researchers claiming that root cause analysis is an invalid, blame-focused practice that ignores systems and complexity. Most root cause investigators that I know are pretty well oriented towards process, organization, and system issues as the fundamental sources underlying problems and accidents... and even some of our simplest analysis tools (e.g., TWIN) include specific checks for complex-system characteristics/behaviours (e.g., hidden system responses, separation between cause and effect).
I used to read a lot more than I do now... sometimes for pure enjoyment, but also for research about my favourite topics (like Root Cause Analysis, of course). Here are some of the books and reports on my root cause bookshelf right now, things I plan to read during the 4th quarter of 2014. I guess I'm in the mood for accident theory and models, and for big books!
Problems come in all shapes and sizes. I've been involved in all kinds of investigations, from those dealing with something as mundane a chronic lack of hot water in a shower facility, to something as critical as a software error that caused non-conservative miscalculations of reactor operating limits. I've even been involved in a fairly significant event before, which my "friends" keep reminding me about even though such remembrances cause me great pain and embarrassment. Sometimes, though, an event comes along that really drives home the value of doing a thorough incident investigation and root cause analysis.
Is root cause analysis possible for complex systems? Some would say no, claiming that such systems are intractable -- that in complex systems, there is no such thing as causality, only pattern and correlation (see Pollard). Even the language used is different. Where others would say problem, cause, and solution, they say situation, pattern, and approach. These terminology differences aside, are the two viewpoints actually incompatible? Is root cause analysis possible for complex systems?
DrPat reads a lot, and frequently writes about books on Paper Frigate. One recent blog entry caught my eye, a review of Beyond Engineering by Robert Pool. DrPat briefly covers the major sections of the book, then concludes with the following:
In this increasingly technological age, the complexity of technology has grown to the point where no one person can know everything about even a very restricted discipline, at the same time that more and more of societal attention is focused on how these complex systems interact. Pool's book is a good first step on the road to the re-engineering of engineering itself, and an excellent argument that such a sweeping change is essential.
In 1931, HW Heinrich published his findings from a review of hundreds of thousands of safety incidents. His data showed that on average, for every 300 near-miss events without injury, there would be 29 minor to moderate injuries and 1 major injury or fatality. Similar studies done since 1931 have yielded similar results. The data is deceptively, compellingly simple -- the meaning, however, is not. What is implied by a 300:29:1 ratio of near-misses to moderate injuries to major injuries? Why do we care? Is there some deeper, underlying pattern to this data?