top | item 38524831

(no title)

Grambles | 2 years ago

Using the literal words: "You screwed up." -- can you make an example of a way that would be helpful in an incident, either during the process, or after?

I can't. There's no value in it. What did we, as a team, do that allowed the incident to happen? Yes, John Smith shouldn't have dropped the tables on production, obviously, but does he really not know that as part of the incident response that he's (presumably) also dealing with?

If he's truly not aware that was a mistake, there's an underlying transparency issue that goes way beyond telling an individual they screwed up.

discuss

order

pclmulqdq|2 years ago

I can, in that given situation. Your assumption is that John Smith should be told "you screwed up" for dropping the prod tables. He shouldn't - he made a normal mistake. The negligent party here, who should hear the words "you screwed up," is Johnny's boss or tech lead, who decided that those tables should be droppable in the first place, despite the obvious risk.

"Negligent" doesn't just mean "made a mistake." It means something more like "their carelessness led to a mistake."

That person hearing "you screwed up" will cause significant behavior change. I daresay it will encourage them to make the prod tables very hard to drop, and since they are presumably a smart person, when combined with the postmortem of the incident, it will encourage them to look for and proactively fix similar problems, and generally align the team with good DevOps practices.

It is important in all of this that the right person gets the message. I assume you expect that to not happen, since that is one of the theses of "blameless postmortems."

Grambles|2 years ago

I suppose I agree more with you than before, but I still think that aside from the fact that a manager or tech lead is ostensibly used to hearing people be angry (or frustrated, or whatever) -- why is John's boss more deserving of "You screwed up." than the person who dropped them? Yeah, obviously, the prod tables shouldn't have been droppable. John still shouldn't have dropped them. In fact, John, the original developer who implemented them, anyone who altered them since, the tech lead, manager, and frankly anyone who knew this data was critical could have all raised the alarm.

Ideally a blameless post mortem allows the freedom to identify any of the potential fixes that could've stopped this, and empowers anyone who could've dealt with it to deal with future issues. If you blame the manager then that can implicitly absolve everyone else in the chain.

With that said, I would agree that having a primary owner of things does matter. For that reason, sure, making the manager more aware might help in future. I still think it's a bad idea for org culture though because many managers will respond to "You screwed up." with trying to ensure future blameables find their way to another target. Instead, I'd prefer approaching the manager with "We could've caught this in [any of the ways we could've caught it].", and if the manager doesn't care at that point they're just fully incompetent.