(no title)
terlisimo | 1 year ago
One of my guys made a mistake while deploying some config changes to Production and caused a short outage for a Client.
There's a post-incident meeting and the client asks "what are we going to do to prevent this from happening in the future?" - probably wanting to tick some meeting boxes.
My response: "Nothing. We're not going to do anything."
The entire room (incl. my side) looks at me. What do I mean, "Nothing?!?".
I said something like "Look, people make mistakes. This is the first time that this kind of mistake had happened. I could tell people to double-check everything, but then everything will be done twice as slowly. Inventing new policies based on a one-off like this feels like an overreaction to me. For now I'd prefer to close this one as human error - wontfix. If we see a pattern of mistakes being made then we can talk about taking steps to prevent them."
In the end the conceded that yeah, the outage wasn't so bad and what I said made sense. Felt a bit proud for pushing back :)
tqi|1 year ago
"Wanting to tick some meeting boxes" feels a bit ungenerous. Ideally, a production outage shouldn't be a single mistake away, and it seems reasonable to suggest adding additional safeguards to prevent that from happening again[1]. Generally, I don't think you need to wait until after multiple incidents to identify and address potential classes of problems.
While it is good and admirable to stand up for your team, I think that creating a safety net that allows your team to make mistakes is just as important.
[1] https://en.wikipedia.org/wiki/Swiss_cheese_model
terlisimo|1 year ago
I didn't want to add a wall of text for context :) And that was the only time I've said something like that to a client. I was not being confrontational, just telling them how it is.
I suppose my point was that there's a cost associated with increasing reliability, sometimes it's just not worth paying it. And that people will usually appreciate candor rather than vague promises or hand-wavy explanations.
braza|1 year ago
The final assessment in the Incident Review was that we should have a multi-cloud strategy. Our luck that we had a very reasonable CTO that prevented the team do to that.
He said something along the lines that he would not spend 3/4 of a million plus 40% of our engineering time to cover something that rarely happens.
tetha|1 year ago
Like, sure, people with access to the servers can run <ansible 'all' -m cmd -a 'shutdown now' -b> and worse. And we've had people nuke productive servers, so there is some impact involved in our work style -- though redundancy and gradually ramping up people from non-critial systems to more critical systems mitigates this a lot.
But some people got a bit concerned about the potential impact.
However if you realistically look at the amount of changes people push into the infrastructure on a daily basis, the chance of this occurring seems to very low - and errors mostly happen due to pressure and stress. And our team is already over capacity, so adding more controls on this will slow all of our internal customers down a lot too.
So now it is just a documented and accepted risk that we're able to burn production to the ground in one or two shell commands.
terlisimo|1 year ago
The amount of deliberate damage anyone on my team can do is pretty much catastrophic. But we accept this as risk. It is appropriate for the environment. If we were running a bank, it would be inappropriate, but we're not running a bank.
I pushed back on risk management one time when The New Guy rebuilt our CI system. It was great, all bells and whistles and tests, except now deploying a change took 5 minutes. Same for rolling back a change. I said "Dude, this used to take 20 seconds. If I made a mistake I would know, and fix it in 20 seconds. Now we have all these tests which still allow me to cause total outage, but now it takes 10 minutes to fix it." He did make it faster in the end :)
tpoacher|1 year ago
[0] https://news.ycombinator.com/item?id=33229338
Aeolun|1 year ago
When you have zero incidents using the temporary process people will automatically start to assume it’s due to the temporary process, and nobody will want to take responsibility for taking it out.
yarekt|1 year ago
I’d go further to say that it’s a trap to try, it’s obvious that you can’t get 100% reliability, but people still feel uneasy with doing nothing
Baeocystin|1 year ago
...but that's not really nothing? You're acknowledging the error, and saying the action is going to be watch for a repeat, and if there is one in a short-ish amount of time, then you'll move to mitigation. From a human standpoint alone, I know if I was the client in the situation, I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
Don't get me wrong; I agree with your assessment. But don't sell non-technical actions short!
Dylan16807|1 year ago
Which is important but not taking an action.
> and saying the action is going to be watch for a repeat
That watching was already happening. Keeping the status quo of watching is below the level of meaningful action here.
> if there is one in a short-ish amount of time, then you'll move to mitigation.
And that would be an action, but it would be a response to the repeat.
> I'd be a lot happier hearing someone say this instead of a blanket 'nothing'.
They did say roughly those things, worded in a different way. It's not like they planned to say "nothing" and then walk out without elaborating!
terlisimo|1 year ago
The client was satisfied after we owned the mistake, explained that we have a number of measures in place for preventing various mistakes, and that making a test for this particular one doesn't make sense. Like, nothing will prevent me from creating a cron job that does "rm -rf * .o". But lights will start flashing and fixing that kind of blunder won't take long.
hi_hi|1 year ago
You basically took the ROAM approach, apparently without knowing it. This is a good thing. https://blog.planview.com/managing-risks-with-roam-in-agile/
ErrantX|1 year ago
Corollary is that Risk Management is a specialist field. The least risky thing to do is always to close down the business (can't cause an incident if you have no customers).
Engineers and product folk, in particular, I find struggle to understand Risk Management.
When juniors ask me what technical skill I think they should learn next my answers is always; Risk Management.
(Heavily recommended reading: "Risk, the science and politics of fear")
Aeolun|1 year ago
How do you do engineering without risk management? Not the capitalized version, but you’re basically constantly making tradeoffs. I find it really hard to believe that even a junior is unfamiliar with the concept (though the risk they manage tends to be skewed towards risk to their reputation).
AnimalMuppet|1 year ago
debacle|1 year ago