top | item 41900811

(no title)

tgavVs | 1 year ago

> From what I've seen in Amazon it's pretty consistent that they do not blame the messenger which is what they consider the person who messed up

Interesting that my experience has been the exact opposite.

Whenever I’ve participated in COE discussions (incident analysis), questions have been focused on highlighting who made the mistake or who didn’t take the right precautions.

discuss

grogenaut|1 year ago

I've bar raised a ton of them. You do end up figuring out what actions by what operator caused what issues or didn't work well, but that's to diagnose what controls/processes/tools/metrics were missing. I always removed the actual people's name as part of the bar raising, well before publishing, usually before any manager sees it. Instead used Oncall 1, or Oncall for X team, Manager for X team. And that's mainly for the timeline.

As a sibling said you were likely in a bad or or one that was using COEs punatively.

mlyle|1 year ago

In the article's case, there's evidence of actual malice, though-- sabotaging only large jobs, over a month's time.

aitchnyu|1 year ago

Whats bar raising in this context?

donavanm|1 year ago

As I recall the coe tool “automated reviewer” checks cover this. It should flag any content that looks like a person (or customer name) before the author submits it.

sokoloff|1 year ago

I’ve run the equivalent process at my company and I absolutely want us to figure out who took the triggering actions, what data/signals they were looking at, what exactly they did, etc.

If you don’t know what happened and can’t ask more details about it, how can you possibly reduce the likelihood (or impact) of it in the future?

Finding out in detail who did it does not require you to punish that person and having a track record of not punishing them helps you find out the details in future incidents.

unknown|1 year ago

[deleted]

geon|1 year ago

Isn't that a necessary step in figuring out the issue and how t prevent it?

Cthulhu_|1 year ago

But when that person was identified, were they personally held responsible, bollocked, and reprimanded or were they involved in preventing the issue from happening again?

"No blame, but no mercy" is one of these adages; while you shouldn't blame individuals for something that is an organization-wide problem, you also shouldn't hold back in preventing it from happening again.

grogenaut|1 year ago

Usually helping prevent the issue, training. Almost everyone I've ever seen cause an outage is so "oh shit oh shit oh shit" that a reprimand is worthless, I've spent more time a) talking them through what they could have done better and, encouraging them to escalate quicker b) assusaging their fears that it was all their fault and they'll be blamed / fired. "I just want you to know we don't consider this your fault. It was not your fault. Many many people made poor risk tradeoffs for us to get to the point where you making X trivial change caused the internet to go down"

In some cases like interns we probably just took their commit access away or blocked their direct push access. Now a days interns can't touch critical systems and can't push code directly to prod packages.

dockerd|1 year ago

That was not the idea of COE ever. Probably you were in bad org/team.

kelnos|1 year ago

Or maybe you were in an unusually good team?

I always chuckle a little when the response to "I had a bad experience" is "I didn't, so you must be an outlier".