top | item 45982032

(no title)

"I don’t know. I wish technical organisations would be more thorough in investigating accidents." - This is just armchair quarterbacking at this point given that they were forthcoming during the incident and had a detailed post-mortem shortly after. The issue is that by not being a fly on the wall in the war room the OP is making massive assumptions about the level of discussions that take place about these types of incidents long after it has left the collective conscience of the mainstream.

discuss

cairnechou|3 months ago

"Armchair quarterbacking" is spot on.

The author asks for a deep, system-theoretic analysis... immediately after the incident. That's just not how reality works.

When the house is on fire, you put it out and write a quick "the wiring was bad" report so everyone calms down. You don't write a PhD thesis on electrical engineering standards within 24 hours. The deep feedback-loop analysis happens weeks later, usually internally.

kqr|3 months ago

> The deep feedback-loop analysis

Thinking the consideration of feedback loops requires "deep" analysis is, I suspect, part of the problem! The insufficient feedback shows up at a very shallow level.

cogman10|3 months ago

People outside of tech (and some inside) can be really bad at understanding how something like this could slip through the cracks.

Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.

The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.

I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.

Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.

jsnell|3 months ago

I don't believe that is an accurate description of the issue. It wasn't that the system got too slow due to a big file, it's that the file getting too big was treated as a fatal error rather than causing requests to fail open.

s1mplicissimus|3 months ago

> This isn't a code update, it was a config update

Oh okay, well I guess the outage wasn't a real issue then

kqr|3 months ago

The article makes no claim about the effort that has gone into the analysis. You can apply a lot of effort and still only produce a shallow analysis.

If the analysis has not uncovered the feedback problems (even with large effort, or without it), my argument is that a better method is needed.