top | item 31828484

(no title)

grenbys | 3 years ago

Would be great if the timeline covered 19 minutes of 6:32 – 06:51. How long did it take to get the right people on the call? How long did it take to identify deployment as a suspect?

Another massive gap is the rollback: 6:58 – 7:42 – 44 minutes! What exactly was going on and why did it take so long? What were those back-up procedures mentioned briefly? Why engineers where stepping on each other toes? What's the story with reverting reverts?

Adding more automation, tests and fixing that specific ordering issue of course is an improvement. But that adds more complexity and any automation ultimately will fail some day.

Technical details are all appreciated. But it is going to be something else next time. Would be great to learn more about human interactions. That's where the resilience of a socio-technical system happened and I bet there is some room for improvement there.

discuss

systemvoltage|3 years ago

It would be fun to be a fly on the wall when shit hits the fan in general. From Nuclear meltdowns to 9/11 ATC recordings, it is fascinating to see how emergencies play out and what kind of things go on with boots-on-ground, all-hands-on-deck situations.

Like, does Cloudflare have an emergency procedure for escalation? What does that look like? How does the CTO get woken up in the middle of the night? How to get in touch with critical and most important engineers? Who noticed Cloudflare down first? How do quick decisions get made and decided? Do people get on a giant zoom call? Or emails going around? What if they can't get hold of the most important people that can flip switches? Do they have a control room like the movies? CTO looking over the shoulder calling "Affirmative, apply the fix." followed by a progress bar painfully moving towards completion.

matsur|3 years ago

We spend a lot of time and thought building out our incident management processes and tooling. We were not making things up as we went last night.

https://sre.google/resources/book-update/managing-incidents/ is Google focused, but our flavor of incident response is not too far off.

nijave|3 years ago

Sounds like they had engineers connecting to the devices and manually rolling back changes. Something like...

Slack: "@here need to connect to <long list of devices> to rollback change asap"