"I don’t know. I wish technical organisations would be more thorough in investigating accidents." - This is just armchair quarterbacking at this point given that they were forthcoming during the incident and had a detailed post-mortem shortly after. The issue is that by not being a fly on the wall in the war room the OP is making massive assumptions about the level of discussions that take place about these types of incidents long after it has left the collective conscience of the mainstream.
cairnechou|3 months ago
The author asks for a deep, system-theoretic analysis... immediately after the incident. That's just not how reality works.
When the house is on fire, you put it out and write a quick "the wiring was bad" report so everyone calms down. You don't write a PhD thesis on electrical engineering standards within 24 hours. The deep feedback-loop analysis happens weeks later, usually internally.
kqr|3 months ago
Thinking the consideration of feedback loops requires "deep" analysis is, I suspect, part of the problem! The insufficient feedback shows up at a very shallow level.
cogman10|3 months ago
Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.
The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.
I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.
Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.
jsnell|3 months ago
s1mplicissimus|3 months ago
Oh okay, well I guess the outage wasn't a real issue then
kqr|3 months ago
If the analysis has not uncovered the feedback problems (even with large effort, or without it), my argument is that a better method is needed.