top | item 45976227

(no title)

The unwrap: not great, but understandable. Better to silently run with a partial config while paging oncall on some other channel, but that's a lot of engineering for a case that apparently is supposed to be "can't happen".

The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.

The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.

The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.

discuss

vbezhenar|3 months ago

IMO: there should be explicit error path for invalid configuration, so the program would abort with specific exit code and/or message. And there should be a superviser which would detect this behaviour, rollback old working config and wait for few minutes before trying to apply new config again (of course with corresponding alerts).

So basically bad config should be explicitly processed and handled by rolling back to known working config.

jgilias|3 months ago

You don’t even need all the ceremony. If the config gets updated every 5 minutes, it surely is being hot-reloaded. If that’s the case, the old config is already in memory when the new config is being parsed. If that’s the case, parsing shouldn’t have panicked, but logged a warning, and carried on with the old config that must already be in memory.

bungle|3 months ago

System outputting the configuration file failed (it could check the size and/or content and stop right away), but also a system importing the file failed. These usually sound simple/stupid in a hindsight. I am not a fan of everything centralising to a few hands. As in bad situation, they can also be weaponised or attacked. And in good situation their blast radius is just too big and a bit random, in this case global.

twoodfin|3 months ago

The query is surely faulty: Even if this wasn’t a huge distributed database with who-knows-what schemas and use cases, looking up a specific table by its unqualified name is sloppy.

But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.

watchful_moose|3 months ago

It's probably not ok to silently run with a partial config, which could have undefined semantics. An old but complete config is probably ok (or, the system should be designed to be safe to run in this state).

philipwhiuk|3 months ago

For unwrap, Cloudflare should consider adding lint tooling that prevents unwrap being added to production code.

groundzeros2015|3 months ago

It’s a feature, not a bug. Assert assumptions and crash on bad one.

Crashing is not an outage. It’s a restart and a stack trace for you to fix.

nijave|3 months ago

Quite surprising a single bad config file brought down their entire global network across multiple products

Xunjin|3 months ago

Share the same opinion, as others pointed out, the status page down probably caused by bots checking it.