The Cult of the Root Cause

[+] bartread|7 years ago|reply

It's also worth pointing out that you may not be able to fix the root cause. The sailing example is great here:

- Can you fix the crack in your hull whilst you're out at sea? Almost certainly not.

- Can you even tell you have a crack in the hull until you've reduced the water in the bilges by running the pumps for a time? Again, quite possibly not.

Treating the symptom is really the only sensible option, unless it's serious enough that you need to put out a Mayday. Again, not a course of action that addresses the root cause but, in some situations, absolutely the right thing to do. To take things up many, many notches, the sinking of the Titanic was an appalling tragedy from which relatively few people were saved, but I guarantee that nobody at all would have been saved if the people on the ship had opted for a series of committee meetings about how to solve the problem of the iceberg. (Not to say there weren't a very large number of hideous blunders in the management of that situation going all the way back to the ship's design and fit-out.)

Moreover, another problem with Five Whys is that, applied heedlessly, it's an extraordinarily arrogant philosophy, because it makes the assumption that you can know the answer to those five whys. Often you can't, at least not without going on a journey, and fixing a few things along the way, and without that, you can simply be wasting your time trying to answer questions that you can't answer in your present situation.

And that really, and somewhat poignantly, cuts to the root cause of why I view the kind of thinking frameworks/fads popular in business with a degree of scepticism: over-applied or misapplied, they paralyse people into inaction, and thereby provide fertile stock for breeding mediocrity.

[+] roenxi|7 years ago|reply

If you get the iceburg as a root cause then you are really doing something wrong; in fact I would suggest the point of a 5-Whys would be to move people past thinking that the iceburg is the problem and get to the actual root cause :P.

Why is the ship sinking? Hit an Iceburg. Why did we hit an iceburg? [Equipment or command failure] What is wrong with [maintenance strategy/command structure] that allowed this mistake to slip through? [technical details]

If you have a team of people and something goes wrong, it is overwhelmingly likely that a human could make a decision differently or do some task that is not being done that would mitigate the worst of the damage.

People absolutely are saved by the committee process, in the same way that items tend to roll downhill. Pretending that humans have perfect control over their environment and could have done something more has proven remarkably successful at getting results. It isn't very impressive, it feels very unreasonable, and it isn't going to work on its own, but it is a very useful tool to let people to stand up and ask "sure something is going on that is out of our control, so why aren't we ready for it? This happens sometimes and we need to be prepared".

Basically, if you just ask why 5 times without any sense of personal responsibility you'll get stupid results. True of any process. But if an uncontrollable event has impacted your endeavor, it is absolutely worth asking "why are we exposing ourselves to a risk we can't control? could we somehow have avoided this".

[+] jdietrich|7 years ago|reply

The tech industry is badly re-inventing processes that the rest of the world developed decades ago. Other industries have learned to deal with far more complex and serious issues of quality. If there is arrogance in our approach, it's our failure to learn from other industries; lots of people in the tech industry have heard of the Five Whys, but very few have actually read Taiichi Ohno's Workplace Management, W. Edwards Deming's Out Of The Crisis or Walter A. Shewhart's Economic Control of Quality of Manufactured Product.

I frequently recommend this lecture on the Piper Alpha disaster, a fire on an offshore oil rig that killed 167 men. It eloquently summarises the findings of the Cullen Enquiry, a six month study of exactly why the disaster happened and what could be done to improve safety in the offshore industry. The enquiry found a complex and interconnected set of factors encompassing process, training, culture and design. It is densely packed with lessons that can be applied to our industry, not least of which is the idea of conducting intensive and systematic inquiries into major failures.

https://www.youtube.com/watch?v=S9h8MKG88_U

[+] williamdclt|7 years ago|reply

The 5 Whys is just meant as a simple tool easy to use, of course it won't solve your big picture problems and nobody is questionning that. You're raising a strawman. Any methodology applied heedlessly will lead to bad result.

[+] taeric|7 years ago|reply

Just want to agree with you. Treating the symptom during an event really is about the best you can do.

I'll also agree with the "you may not be able to know a why." For systems, that can be a good guide on instrumentation to add. Sucks, in that you won't prevent the next event. Good that you should be able to get more from it.

[+] CPLX|7 years ago|reply

Much depends on context. If you're the captain of the Titanic it doesn't make sense to focus on the iceberg as the root cause. If you're the CEO of White Star lines it certainly does.

[+] maccam94|7 years ago|reply

This article is creating a straw man. You don't do a 5 Whys and fix what you think the root cause was, you fix the issues that were most serious. If there are problems too big to tackle immediately, put in short term mitigations and incorporate your new understanding of the system's reliability into your future plans.

[+] hinkley|7 years ago|reply

Doubly so because you don't use the 5 whys during the emergency. You use it during the post mortem. After you've unplugged the burning computer. After you've gotten the ship out of the immediate existential threat. If the computers are burning due to faulty wiring no amount of triage is going to stop that from happening next week. If ships are getting holes because the currents have shifted and bergs are appearing in places where hobby sailors frequent then the maps and some public outreach are the right solution.

I dunno who taught these people about the 5 Whys, but someone (possibly themselves) has done them a tremendous disservice.

[+] jaggederest|7 years ago|reply

More importantly, it's about not starting to fix things until you understand the context. The fix might be at any of the levels, but it's pointless to speculate where the fix goes if you don't understand the whole system/network/causality chain to an adequate degree. An "adequate degree" might be shallow overview, a deep dive, or anywhere in between, but it's still necessary.

[+] yaleman|7 years ago|reply

Couldn't have put it better myself - this article was clearly written by someone who's worked in an entirely disfunctional environment - or has read an article on "five why's" and missed the whole rest of the problem management framework.

[+] phlakaton|7 years ago|reply

If there is a cult of the root cause, I have yet to meet it.

Here's what doing an exercise similar to 5 Why's gets you:

- An understanding of where issues come from. Whether your plan of action starts from the top, bottom, or middle, taking the time to step back and broaden your perspective before you jump in to fix a problem helps to make sure you're going after the right things for the right reasons.

- A culture of _not_ just picking the most expedient and facile solution every time issues come up, and going with that. In companies I've been in, the pressure is almost always on to find the dumbest, hackiest, absolutely fastest path out of trouble. Spend multiple years solving every problem that way, and you are in deep trouble! It takes institutional courage to push back against that, and having a practice in place to force you to stop and think now and again gives you an opportunity to summon that courage.

- A culture of ownership. This seems a little counterintuitive to me, since if you follow root causes deep enough you're liable to stumble onto people and process problems that are way out of your control and pay grade. Looking at root causes this way, you might think it's a process of passing the buck. However, by shining a light on such things, and finding people to address those things where they have no owner, you can push towards a better collective ownership of the real issues that face your company.

No good management idea is free from abuse, of course. You must exercise taste and judgment in deciding how deep to push with root causes, and what to do with the discoveries. I would think it's rather self-evident that 5 Why's doesn't mean you always ask exactly five questions in a strictly linear pattern. But for heaven's sake, make sure you ask more than one!

[+] cirgue|7 years ago|reply

I would say it's even more basic than that: the 5 whys are a way to push people to gather information and talk to one another before making decisions about solutions. The point is not to achieve perfection, but to consistently not make stupid and easily avoidable mistakes.

[+] 11thEarlOfMar|7 years ago|reply

I've never taken the 5 Whys literally. It's obvious to me that all root causes are not '5 Deep', therefore, this can't be a literal objective. I see it as a metaphor for being an effective problem solver, as a reminder to second guess the cause I've identified and ask myself or my team, "Is there a deeper, underlying cause?".

[+] ssivark|7 years ago|reply

The most interesting cases are complex systems where a fault/even results from a combination of multiple factors. In that case, asking for a "root" cause is unproductive, and a better question to ask would be: how might the system be patched, with the least pernicious side effects. This is also why I don't always like the drive for more "accountability". As can be seen in the other HN thread on the front page about data driven medicine and the side effects on the healthcare system, you need to be very careful about which of the causes you decide to intervene on.

When you closely manage something to reduce variation, you also lose any information you might get about the system from the variation of that quantity. This point is nicely made in another post on the same blog: http://reinertsenassociates.com/the-dark-side-of-robustness/

Especially with reflexive systems (involving humans), the appropriate response might sometimes involve performing no intervention, or performing an intervention downstream, to modify its assumptions about what it receives (eg: adding error handling).

[+] osteele|7 years ago|reply

My ops postmortem template tried to elicit breadth. Once you’ve got a forest of causes, you can apply Five Whys to add depth.

There’s some overlap among the following questions. The intent is to elicit observations and ideas, not to uniquely categorize them.

* What are all the factors that could have prevented the incident?

* What are all the factors that could have detected the issue before production?

* What are all the factors that could have detected the issue sooner when it did occur?

* What are all the factors that could have accelerated mitigation? (Including, especially, changes that could have reduced the risks of mitigations considered too risky to apply.)

* What are all the factors that could have accelerated remediation?

* What could have reduced the scope or impact?

It’s common to come out of this with a laundry list that overfits the last incident and, if applied, would increase the complexity of the system and add risk. We’d typically apply one or two fixes, and stockpile the rest to see if any of them would have addressed any future incident. Usually most of the “solutions” turn out to be specific to the single incident that prompted them.

[+] jwatte|7 years ago|reply

5y is about finding ways to prevent the conditions that let incidents happen, not just preventing incidents, tough. That's kind of the point of asking 5 why questions. "We fell over because we lost a database server and didn't have enough spare capacity to run peak load on the standby. We don't have full N+1 because it hasn't been funded. It hasn't been funded because the business didn't have a good model for the risk adjusted cost. Action item: add the cost of this outage to the next budget forecast, and add a requirement for risk adjusted cost estimates to all future financial plans."

[+] mirceal|7 years ago|reply

My favorite, goto, on this one: https://blog.acolyer.org/2016/02/10/how-complex-systems-fail...

The truth is that by “fixing” the root cause you will sometimes destabilize the complex system you are running.

[+] dbenhur|7 years ago|reply

Yes, root cause analysis and corrective action should only to be done with Cook's insights in mind.

"Post-accident attribution to a ‘root cause’ is nearly always wrong." "Post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult."

How Complex Systems Fail is short but loaded with value; if you haven't read it, go do so now!

[+] mbesto|7 years ago|reply

> Second, it assumes that the best location to intervene in this chain of causality is at its source: the root cause.

It doesn't though, that's just a built-in assumption for the lazy. The point of Root Cause Analysis and the "5 Whys" isn't necessarily to get to the root and fix the root...it's to provide a framework for traversing a problem set. The point of this methodology is so that you traverse the problem, understanding each step along the way...not that you simply jump to the root and try to fix it blindly.

[+] twelve40|7 years ago|reply

Most of these examples don't seem relevant to RCA at all.

> shifted from pumping, to plugging, to hull repair

Pumping and plugging are immediate response, just like a decision to temporarily shut down the website when a compromise is discovered, or pulling the plug on a smoking computer. What do these have to do with root cause analysis?

[+] jlgaddis|7 years ago|reply

The point was that, sometimes, the best thing is not to worry about finding the root cause -- not now anyways -- but, instead, to "treat the symptom".

That is, when your boat is sinking, immediately focusing on fixing the root cause (the crack in the hull) is not necessarily the best course of action. Instead, treat the obvious symptoms that you can to "stop the bleeding" (plug the hole, pump the water out) and you can deal with the root cause when you're in a better position to do so (back in port, not miles out to sea).

See also: metaphor.

[+] ratacat|7 years ago|reply

I totally appreciate OPs insights here. It can be so tempting to see things linearly. But obviously the real world is anything but. There's a curiously wicked theory out there by David Abrams that part of this seeming human predilection for linearity has occured somehow through neural conditioning involved with adopting systems of writing, compared to the crazy fractal ocean of sensory input that living in a real living landscape begets. The writing has a start. A finish. It progresses in a single visual direction, and what's more, the pieces themselves only represent reality. The letters themselves are completely different from the things they are describing. And they might be one of the first objects in human history that work like this...

Anyway, the book is called the Spell of the Sensuous, and it's dense af, but bursting with fascinating lines of inquiry.

[+] myWindoonn|7 years ago|reply

People are notoriously bad at playing five-whys. If you haven't reached an ideological or metaphysical problem by your third 'why' then you aren't playing well enough.

[+] williamdclt|7 years ago|reply

Once again that's a problem of work culture. the five-whys is super simple, it's up to people to be smart using it. If I had an ideological problem after a 5 whys, my team would kindly ask me what the fuck I am doing

[+] motohagiography|7 years ago|reply

My experience with Root Cause Analysis is it's often an exercise for poor managers to deflect accountability by diluting it among more junior people.

However, 5-whys is very useful as a design principle instead of using it to respond to failure.

It goes something like:

0: build a thing why 1: because customer asked why 2: because thing is what they think they need why 3: because it is one solution to a gap in their ability to achieve something. why 4: because that something is an economic need. why 5: because market opportunity to fill that gap with something, maybe this thing, maybe something better.

[+] amygdyl|7 years ago|reply

I suspect that this is because a human propensity exists to allocate and apportion blame, particularly in cases where the interactions involved in the immediately prior actions are unclear.

I have actually not even once encountered this 5 Why's method.

I neither heard of it before.

But I founded my company in 1996, starting out with almost two hundred years of experience surrounding my incredulous and lucky younger self, including several PhDs and former Fortune 500 board members.

I will hazard that this 5whys technique is fundamentally flawed and easily susceptible to manipulation for procuring a scapegoat.

I only hope that explains why I have never encountered this before. I hope moreso I can feel a little like somethingwas going on in the right way, in my business, to filter and reject what I think is, and definitely comes across as bogus to me.

[+] jschwartzi|7 years ago|reply

The biggest problem with the 5 whys is that the whole concept is taken out of context. In a manufacturing context every repeatable problem is a problem with your process. So in that context it makes a lot of sense to search for a process change that resolves your problem.

Let's say you make light bulbs, and every fifth bulb comes out misshapen. You would use the 5 whys to trace it back to the molding station, where you discover that bulbs cool at a different rate in one of the machines because the mold uses a better insulator. You could stop there and replace the insulator. But if your job is to increase yield, then you can save the company a lot of future money if you figure out how that mold got there in the first place. You might find that purchasing subbed a cheaper replacement based on an incomplete spec. Or you might find that the supplier recently switched materials.

The point is that when you're trying to establish a controlled, repeatable process, you need to understand where your controls break down.

Once you understand the process problem, then you make a business decision about what to change. It was never meant to be applied to R&D problems. R&D processes are not as concerned with repeatedly doing something correctly. They're concerned with making sure something can be done in the first place. It's a different class of problem.

[+] sethammons|7 years ago|reply

This reminds me of what we've been doing for a while, the "blameless postmortem." The technique is championed by Etsy, and you can read more about it in an article that introduces their debriefing guide:

https://codeascraft.com/2016/11/17/debriefing-facilitation-g...

From the linked PDF from the article:

> “Adaptability and learning. We learn through honest, blameless reflection on lessons and surprises. We believe that traditional root cause analysis makes learning from mistakes difficult. Our blameless post-mortem process is a widely-cited technique that we believe is becoming best practice among organizations that value innovation. Blameless postmortems drive a significant percentage of our development as we analyze what about our production environment was less than optimal and rapidly make corresponding adjustments.” (Etsy, Inc., 2015)

The idea, boiled down, is to inspect timelines, procedures, and actions and develop a narrative of how an incident came to be. The goal is for everyone to walk away with a (better) understanding of everything. With this, people are better armed to put into place solutions.

One example from the text is where an engineer pushes out a change because they thought the build system had zero failures. The push breaks the system and causes a regression that should have been caught in the tests. During the postmortem, the engineer says, "I thought the tests had zero failures. I guess I need to be more diligent in the future." Upon further timeline investigation, it is noted that the tests actually had eight failures, but the font had eights and zeros looking very similar. The fix was not "be more diligent;" the fix is maybe to have a better font or use colors for pass/fail.

Overall, I like the ideas proposed in this blameless postmorem style. It runs counter to the natural tendency to "find a problem and fix it" because it feels like we are talking less about the problem and the fix and talking more about the narrative of the failure. But what I've seen is folks gaining better understanding of how everyone else works, learning about tools and tricks, and about assumptions. And knowing more about the narrative leads to better solutions.

[+] dingaling|7 years ago|reply

> Our blameless post-mortem process

Theirs?

Accident investigation agencies such as the AAIB and NTSB have been following "blameless" processes for decades. Find the causes and save lives. Who pressed the button or forgot to connect the oil line is irrelevant compared to the fact that the failure modes were possible.

[+] maxxxxx|7 years ago|reply

Shouldn't it be the cult of the single root cause? Most problems seem to have several contributing factors. You can almost randomly pick one factor, improve it, and the whole situation will get better.

You see this a lot in public debate like education or health care. Instead of fixing one of the many problems a lot of time and energy is wasted on finding THE root cause that will fix everything.

[+] Moto7451|7 years ago|reply

This was what came to my mind as well. The fire strawman presented exemplifies this. Finding the root cause is simply a mental model. Discovering that there are three and picking the best one to fix (assuming you’re limited by time/money/complicity/etc or it’s undesirable to fix others) is 100% ok. The existence of instances of multiple root causes does not really say anything negative about Root Cause Analysis.

In cases where you can’t discover the root cause (I.e. a plane that explodes and destroys the root cause) you simply have to go as deep as is reasonable and work from there.

If someone is unwilling to be reasonable and accept a number of root causes between M-N then the issue is with them, not Root Cause Analysis.

[+] BaronSamedi|7 years ago|reply

In my view, what we see in public debate on these issues, and indeed most political issues, is a sole focus on the symptoms of the problem and never an analysis of the factors causing these symptoms. Adding to the dysfunction, no analysis of potential second-order effects of the symptom-treating solutions is ever done.

[+] Illniyar|7 years ago|reply

I wouldn't take such an advice. For me getting to the root cause of things is one of the core values of a good programmer (and for operations as well).

Cause #2 is also fictitious, the 5 whys never say anything about fixing the problem, only understanding it. In fact, for me, not fixing the problem is just as valid a solution - as long as you know what caused it, you can determine if it's worth fixing it at the root or even at all.

As to the linearity of cause and effect, while it's true that many problems have multiple causes, a solution to a linear problem will prevent alternative causes below it. Besides, the grand majority of issues arising in mature systems arise from a single cause and have linear cause and effect.

[+] placebo|7 years ago|reply

Reminded me of this entertaining clip:

http://vooza.com/videos/the-5-whys/

I doubt that most people who use the five whys really take it to be as simplistic as the author of this article suggested, but to those that do, it's a good wake up call.

In fact, life is even more complex than the article suggests, when you throw in effects of chaos, feedback loops, missing information, unknown influences - just to mention a few. Still, tracking down the order in processes has got humanity quite far (at least as far as being able to predict and engineer accordingly) so it's obviously effective.

[+] mianos|7 years ago|reply

To save some time watching the vid. It's 'because the Illuminati or something'. It is a great little sketch and does illustrate the major objection to the OP's essay.

[+] himom|7 years ago|reply

Seems hand-waving, thin on value and promoting a consulting business.

It would’ve been better to talk about the real-world including TPMS and the NTSB investigation approach... making cars very reliable and very complex aircraft safer with strict regulations.

[+] anoncept|7 years ago|reply

For a deeper look from the systems engineering side, check out http://mit.edu/psas, specifically the book-length treatment in “Engineering a Safer World”.

For applications of similar ideas to cloud software and devops-style environments, https://www.kitchensoap.com/2012/02/10/each-necessary-but-on... is also helpful!

[+] emgee_1|7 years ago|reply

I think that 5W and 5W2H are means to develop a network of logical reasoning. These diagrams ( fishbone and Ishikawa) are used as a means to discuss what attack areas you have concerning a problem. They are used in many ways and in many different fields. They are very useful when designing experiments (DoE). 8D and 10D teams use them extensively in (high tech) manufacturing.

[+] gerbilly|7 years ago|reply

> There are often multiple causes for an effect,

I find that some of the toughest bugs to solve are the ones where the undesirable effect has more than one cause.

To be precise it's the kind of situation where the bug can be triggered by 2, 3 or more independent causes, i.e. each cause is sufficient on its own to cause the bug.

Often when attempting to solve a bug like that I'll find one likely cause, address it but because the bug persists, I end up undoing the fix for cause #1, then finding cause #2 and ping ponging between them till I realise that I have to address multiple causes to make the bug go away.

66 comments