top | item 35035935

(no title)

schoolornot | 3 years ago

Which is precisely why I don't understand the purpose to even have postmortems for 95% of outages. If everyone is aware of what went wrong and the issue is unlikely to ever happen again, what is the point?

Well, at companies of the size I work with it is to point fingers, make PMs feel more important, and give people talking points.

discuss

order

adamckay|3 years ago

> If everyone is aware of what went wrong and the issue is unlikely to ever happen again, what is the point?

Because the only way you can make everyone aware is to write it down. Anything else is hearsay.

And going through the process of a thorough postmortem can ensure you do know exactly what went wrong and why, and how you can prevent the same and similar issues from happening again in the future.

Perhaps from this example it serves as documented proof that work on setting up staging databases needs to be prioritised and invested in? Maybe it's that scripts such as this should be reviewed by another engineer before running? Maybe the standard operating procedure is updated so a backup is taken immediately before running any scripts that write to the database? Maybe you create a rule to limit the blast radius in future and do smaller roll outs to 1k users instead of 100k? Maybe scripts should be developed with a dry-run feature?

s1gnp0st|3 years ago

If you're having enough outages that postmortems are burdensome, I don't think the problem is the number of postmortems.

joshuamorton|3 years ago

A postmortem is, in most cases, really just a formal process to ensure everyone (including leadership) knows what went wrong, and how to prevent the issue from happening again, with some associated ceremony about like actually keeping track of the relevant action items to prevent recurrence of the issue.

It's pretty rare that you run into an issue that can't reoccur, or have similarly shaped reoccurrences.

MaulingMonkey|3 years ago

> If everyone is aware of what went wrong

Big assumption. Chances are at least someone was out on vacation or something. A postmortem can help spread the word - and perhaps calm those downstream who weren't in the loop as much, through a show of clear and open communication and ownership, instead of trying to bury it.

> and the issue is unlikely to ever happen again,

Also a big assumption. You can almost certainly https://en.wikipedia.org/wiki/Five_whys your way into a broader pattern of something that will happen again, and can review what worked to expectations, what didn't, what should be improved, and what can be left alone - even if it perhaps will happen again - perhaps being more expensive to try and fix, than to merely accept the occasional failure as an acceptable price of doing business.

dasil003|3 years ago

> everyone is aware of what went wrong

This is never true past a certain company size threshold, and even for smaller companies once you start asking the "five whys" you see there is never just one root cause. Even for straightforward cases where everyone generally agrees, it can still be informative to capture the analysis for future reference and review of reliability patterns.

Incidents are a learning opportunity. If people are pointing fingers then that's a sign of a bigger cultural issue, one which will not be solved by avoiding discussing and documenting incidents.