> In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.
It still astounds me that the big dogs still do not phase config rollouts. Code is data, configs are data, they are one and the same. It was the same issue with the giant crowdstrike outage last year, they were rawdogging configs globally and a bad config made it out there and everything went kaboom.
You NEED to phase config rollouts like you phase code rollouts.
The big dogs absolutely do phase config rollouts as a general rule.
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network
I think it's uncharitable to jump to the conclusion that just because there was a config-based outage they don't do phased config rollouts. And even more uncharitable to compare them to crowdstrike.
In a company I am no longer with I argued much the same when we rolled out "global CI/CD" on IAC. You made one change, committed and pushed, wham it's on 40+ server clusters globally. I hated it. The principal was enamored with it, "cattle not pets" and all that, but the result was things slowed down considerably because anyone working with it became so terrified of making big changes.
Because adversaries adapt quickly, they have a system that deploys their counter-adversary bits quickly without phasing - no matter whether they call them code or configs. See also: Crowdstrike.
Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?
>Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?
This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.
This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.
Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.
Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.
It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.
I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.
Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.
Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.
__turbobrew__|3 months ago
You NEED to phase config rollouts like you phase code rollouts.
crazygringo|3 months ago
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network
siegecraft|3 months ago
JohnMakin|3 months ago
wbl|3 months ago
immibis|3 months ago
himinlomax|3 months ago
JohnMakin|3 months ago
imdsm|3 months ago
sammy2255|3 months ago
nobody9999|3 months ago
This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.
This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.
Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.
Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.
It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.
I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.
Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.
Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.
[0] https://en.wikipedia.org/wiki/Airline_reservations_system
[1] https://en.wikipedia.org/wiki/Stockholm_syndrome