(no title)
thatoneengineer | 3 months ago
The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.
The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.
The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.
vbezhenar|3 months ago
So basically bad config should be explicitly processed and handled by rolling back to known working config.
jgilias|3 months ago
bungle|3 months ago
twoodfin|3 months ago
But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.
watchful_moose|3 months ago
philipwhiuk|3 months ago
groundzeros2015|3 months ago
Crashing is not an outage. It’s a restart and a stack trace for you to fix.
nijave|3 months ago
Xunjin|3 months ago