top | item 43151392

(no title)

hackpelican | 1 year ago

In the places I’ve worked, a war room was always the place where we cut the bleeding and revert the system to a working state. Never was the RCA the intended outcome of a war room, though we’d often reach the RCA in the silence of the meeting bridge while something deployed/rolled back.

Root cause analysis is definitely not a group activity, it’s best done in a place where one can have complete focus.

However, cutting the bleeding requires plenty of communication, weighing different options, having a higher-up sign off on a tradeoff, getting our ops team to coordinate towards some common goal, monitoring the recovery… etc.

discuss

order

afro88|1 year ago

IIRC, Facebook don't (or didn't) do rollbacks. They always fix forward. I guess hours long incidents like this are the other edge of that double edged sword.

claytonjy|1 year ago

Language can be tricky here. If I revert to an older commit, literally rewriting history to remove newer, bad commits, I think we’d all consider that a rollback. But if I instead add a new commit which undoes the bad commits, is that a rollback or a roll forward?

sunshowers|1 year ago

So interestingly, I think root cause analysis can be a group effort, but I think it has to be done on a remote call where everyone is in front of a big monitor or two, and people can take breaks and such. I've been part of teams that have done root cause analysis over a call (sometimes many calls), and it's been quite effective.