(no title)
grenbys | 3 years ago
Another massive gap is the rollback: 6:58 – 7:42 – 44 minutes! What exactly was going on and why did it take so long? What were those back-up procedures mentioned briefly? Why engineers where stepping on each other toes? What's the story with reverting reverts?
Adding more automation, tests and fixing that specific ordering issue of course is an improvement. But that adds more complexity and any automation ultimately will fail some day.
Technical details are all appreciated. But it is going to be something else next time. Would be great to learn more about human interactions. That's where the resilience of a socio-technical system happened and I bet there is some room for improvement there.
systemvoltage|3 years ago
Like, does Cloudflare have an emergency procedure for escalation? What does that look like? How does the CTO get woken up in the middle of the night? How to get in touch with critical and most important engineers? Who noticed Cloudflare down first? How do quick decisions get made and decided? Do people get on a giant zoom call? Or emails going around? What if they can't get hold of the most important people that can flip switches? Do they have a control room like the movies? CTO looking over the shoulder calling "Affirmative, apply the fix." followed by a progress bar painfully moving towards completion.
matsur|3 years ago
https://sre.google/resources/book-update/managing-incidents/ is Google focused, but our flavor of incident response is not too far off.
nijave|3 years ago
Slack: "@here need to connect to <long list of devices> to rollback change asap"