(no title)
schoolornot | 3 years ago
Well, at companies of the size I work with it is to point fingers, make PMs feel more important, and give people talking points.
schoolornot | 3 years ago
Well, at companies of the size I work with it is to point fingers, make PMs feel more important, and give people talking points.
adamckay|3 years ago
Because the only way you can make everyone aware is to write it down. Anything else is hearsay.
And going through the process of a thorough postmortem can ensure you do know exactly what went wrong and why, and how you can prevent the same and similar issues from happening again in the future.
Perhaps from this example it serves as documented proof that work on setting up staging databases needs to be prioritised and invested in? Maybe it's that scripts such as this should be reviewed by another engineer before running? Maybe the standard operating procedure is updated so a backup is taken immediately before running any scripts that write to the database? Maybe you create a rule to limit the blast radius in future and do smaller roll outs to 1k users instead of 100k? Maybe scripts should be developed with a dry-run feature?
s1gnp0st|3 years ago
joshuamorton|3 years ago
It's pretty rare that you run into an issue that can't reoccur, or have similarly shaped reoccurrences.
MaulingMonkey|3 years ago
Big assumption. Chances are at least someone was out on vacation or something. A postmortem can help spread the word - and perhaps calm those downstream who weren't in the loop as much, through a show of clear and open communication and ownership, instead of trying to bury it.
> and the issue is unlikely to ever happen again,
Also a big assumption. You can almost certainly https://en.wikipedia.org/wiki/Five_whys your way into a broader pattern of something that will happen again, and can review what worked to expectations, what didn't, what should be improved, and what can be left alone - even if it perhaps will happen again - perhaps being more expensive to try and fix, than to merely accept the occasional failure as an acceptable price of doing business.
dasil003|3 years ago
This is never true past a certain company size threshold, and even for smaller companies once you start asking the "five whys" you see there is never just one root cause. Even for straightforward cases where everyone generally agrees, it can still be informative to capture the analysis for future reference and review of reliability patterns.
Incidents are a learning opportunity. If people are pointing fingers then that's a sign of a bigger cultural issue, one which will not be solved by avoiding discussing and documenting incidents.