top | item 42011499

(no title)

A personal anecdote:

One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client.

There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes.

My response: "Nothing. We're not going to do anything."

The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?".

I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them."

In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :)

discuss

tqi|1 year ago

[preface that this response is obviously operating on very limited context]

"Wanting to tick some meeting boxes" feels a bit ungenerous. Ideally, a production outage shouldn't be a single mistake away, and it seems reasonable to suggest adding additional safeguards to prevent that from happening again[1]. Generally, I don't think you need to wait until after multiple incidents to identify and address potential classes of problems.

While it is good and admirable to stand up for your team, I think that creating a safety net that allows your team to make mistakes is just as important.

[1] https://en.wikipedia.org/wiki/Swiss_cheese_model

terlisimo|1 year ago

I agree.

I didn't want to add a wall of text for context :) And that was the only time I've said something like that to a client. I was not being confrontational, just telling them how it is.

I suppose my point was that there's a cost associated with increasing reliability, sometimes it's just not worth paying it. And that people will usually appreciate candor rather than vague promises or hand-wavy explanations.

braza|1 year ago

I had a similar situation, but in my case was due to a an upstream outage in a AWS Region.

The final assessment in the Incident Review was that we should have a multi-cloud strategy. Our luck that we had a very reasonable CTO that prevented the team do to that.

He said something along the lines that he would not spend 3/4 of a million plus 40% of our engineering time to cover something that rarely happens.

tetha|1 year ago

We've started to note these wont-fixes down as risks and started talking about probability and impact of these. That has resulted in good and realistic discussions with people from other departments or higher up.

Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.

But some people got a bit concerned about the potential impact.

However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.

So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.

terlisimo|1 year ago

I hear ya, that sounds familiar.

The amount of deliberate damage anyone on my team can do is pretty much catastrophic. But we accept this as risk. It is appropriate for the environment. If we were running a bank, it would be inappropriate, but we're not running a bank.

I pushed back on risk management one time when The New Guy rebuilt our CI system. It was great, all bells and whistles and tests, except now deploying a change took 5 minutes. Same for rolling back a change. I said "Dude, this used to take 20 seconds. If I made a mistake I would know, and fix it in 20 seconds. Now we have all these tests which still allow me to cause total outage, but now it takes 10 minutes to fix it." He did make it faster in the end :)

tpoacher|1 year ago

Good, but I would have preferred a comment about 'process gates' somewhere in there [0]. I.e. rather than say "it's probably nothing let's not do anything" only to avoid the extreme "let's double check everything from now on for all eternity", I would have preferred a "Let's add this temporary process to check if something is actually wrong, but make sure it has a clear review time and a clear path to being removed, so that the double-checking doesn't become eternal without obvious benefit".

[0] https://news.ycombinator.com/item?id=33229338

Aeolun|1 year ago

Nothing more permanent than a temporary process.

When you have zero incidents using the temporary process people will automatically start to assume it’s due to the temporary process, and nobody will want to take responsibility for taking it out.

yarekt|1 year ago

Yep yep, exactly this. When an incident review reveals a fluke that flew past all the reasonable safeguards, a case that the team may have acknowledged when implementing those safeguards. Sometimes those safeguards are still adequate, as you can’t mitigate 100% of accidents, and it’s not worth it to try!

I’d go further to say that it’s a trap to try, it’s obvious that you can’t get 100% reliability, but people still feel uneasy with doing nothing

Baeocystin|1 year ago

> If we see a pattern of mistakes being made then we can talk about taking steps to prevent them.

...but that's not really nothing? You're acknowledging the error, and saying the action is going to be watch for a repeat, and if there is one in a short-ish amount of time, then you'll move to mitigation. From a human standpoint alone, I know if I was the client in the situation, I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.

Don't get me wrong; I agree with your assessment. But don't sell non-technical actions short!

Dylan16807|1 year ago

> You're acknowledging the error,

Which is important but not taking an action.

> and saying the action is going to be watch for a repeat

That watching was already happening. Keeping the status quo of watching is below the level of meaningful action here.

> if there is one in a short-ish amount of time, then you'll move to mitigation.

And that would be an action, but it would be a response to the repeat.

> I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.

They did say roughly those things, worded in a different way. It's not like they planned to say "nothing" and then walk out without elaborating!

terlisimo|1 year ago

The abbreviated story I told was perhaps more dramatic-sounding than it really played out. I didn't just say "Nothing." mic drop walk out :)

The client was satisfied after we owned the mistake, explained that we have a number of measures in place for preventing various mistakes, and that making a test for this particular one doesn't make sense. Like, nothing will prevent me from creating a cron job that does "rm -rf * .o". But lights will start flashing and fixing that kind of blunder won't take long.

hi_hi|1 year ago

If you want to go full corporate, and avoid those nervous laughs and frowns from people who can't tell if you're being serious or not, I recommend dressing it up a little.

You basically took the ROAM approach, apparently without knowing it. This is a good thing. https://blog.planview.com/managing-risks-with-roam-in-agile/

ErrantX|1 year ago

Correct.

Corollary is that Risk Management is a specialist field. The least risky thing to do is always to close down the business (can't cause an incident if you have no customers).

Engineers and product folk, in particular, I find struggle to understand Risk Management.

When juniors ask me what technical skill I think they should learn next my answers is always; Risk Management.

(Heavily recommended reading: "Risk, the science and politics of fear")

Aeolun|1 year ago

> Engineers and product folk, in particular, I find struggle to understand Risk Management.

How do you do engineering without risk management? Not the capitalized version, but you’re basically constantly making tradeoffs. I find it really hard to believe that even a junior is unfamiliar with the concept (though the risk they manage tends to be skewed towards risk to their reputation).

AnimalMuppet|1 year ago

Yeah. Policies, procedures, and controls have costs. They can save costs, but they also have their own costs. Some pay for themselves; some don't. The ones that don't, don't create those procedures and controls.

debacle|1 year ago

Good manager, have a cookie.