top | item 33339609

Partial Cloudflare outage on October 25, 2022

115 points| mfiguiere | 3 years ago |blog.cloudflare.com | reply

55 comments

order
[+] ec109685|3 years ago|reply
Would love to learn more about why a tracing function mutates the request (clears headers). That seems like foot gun that would be impossible for anyone not intimately familiar with the function's implementation to avoid.
[+] jgrahamc|3 years ago|reply
Internal retro is going to be digging into that question (amongst others). I imagine we’ll update the post with further details.
[+] Thorrez|3 years ago|reply
Mutating the headers might make sense if it needs to inject a trace ID into the headers. That doesn't explain why it would clear all headers though.
[+] dannyw|3 years ago|reply
Love Cloudflare or not, the transparency and frankness from the CTO in comments sections like these, is a rare and welcome sight.
[+] teknopaul|3 years ago|reply
I can imagine why this took a while to fix.

It was not DNS.

[+] stevewatson301|3 years ago|reply
The incident report is very light on details. For example, it mentions adding instrumentation caused the issue, but no details as to how why they'd need to remove headers or how they'd contain it in the future, except for that they'd need to "fail fast" in this situation.

(You can obviously catch it with a blue-green deployment, which they do mention, but at that point some small portion of traffic has already been affected.)

[+] jgrahamc|3 years ago|reply
This incident report was out within hours of the failure. The team is now doing an internal retro to look at decision speed, rollback speed, and the circumstances around that function having a side effect. We'll update the post with more detail.
[+] dspillett|3 years ago|reply
> For example, it mentions adding instrumentation caused the issue, but no details as to how why they'd need to remove headers

They probably don't need to at all, but I can think or several ways it might unintentionally happen.

Perhaps someone chose to implement the modification as “separate headers & body, modify headers, put the request back together” and something went wrong that made the putting-back-together fail quietly (so the now malformed request passes to the next part without an exception being raised) or not get called at all in some circumstances. Alternately a badly coded routine to remove a header, perhaps the instrumentation information after the point in the processes it is no longer useful, maybe removed more than it should.

How such things get past a code review without someone suggesting a less error-prone method might be a question that gets asked in the internal investigation. I assume somewhere like CF has many checks between initial code modifications and things dropping into production, unless they've really taken “fail fast” to heart!

[+] stall84|3 years ago|reply
Yes they probably can fill in plenty of detail, but overall I was impressed with the incident report on it's own.. I don't think it's super common for companies' their size to release a timeline/report on the fkup unless they're coerced to.
[+] eis|3 years ago|reply
It seems to me that the rollout process is flawed.

A change should be rolled out to a small fraction of nodes and then monitored for an extended period. 5% of requests returning errors would easily be spotable. Only when a new release has run stable on that portion for some time should it proceed to a bigger subset. You probably want to do this in several steps at the size of CF.

We also learn that it took customer reports to get the investigations rolling but 5xx errors are easily monitored so it points at internal monitoring being lacking even though it's hard for me to believe that they don't have an eye on this already.

It's not the first time that a deploy has brought CloudFlare (partially) down. From the timeline we see there's several hours between incident investigations starting and the rollout being stopped. The rollback should be the first thing considered even before looking at what the actual issue is.

Ideally you have someone sitting next to a red rollback button during a rollout whose only job is it to have an eye on all automatic and customer error reports. :)

[+] ec109685|3 years ago|reply
They did a canary deploy in a data center that didn't have hierarchical caching enabled, so they missed this code path.
[+] hayst4ck|3 years ago|reply
87 sounds large

   2022-10-25 17:03: CDN Release is rolled back in Atlanta and root cause is confirmed.
   2022-10-25 17:28: Peak impact with approximately 5% of all HTTP requests resulting in an error with status code 530.
   2022-10-25 17:38: An accelerated rollback continues with large data centers acting as Upper tier for many customers.
   2022-10-25 18:04: Rollback is complete in all Upper Tiers.
   2022-10-25 18:30: Rollback is complete.
But this seems like the 87 minutes referred to.

The 503 graph appears to show a steep drop between 17:38 and and 17:45, which looks like the "rollback" speed to me.

I think they likely drained many of the lower tier caches/pops/edges after which roll back speed doesn't matter very much since the machines being rolled back likely weren't servicing traffic.

> large data centers acting as Upper tier for many customers.

To me this sounds like a very awkwardly worded way to say "we shifted traffic".

Clarification on whether traffic was shifted or tiers were drained would be nice.

[+] hayst4ck|3 years ago|reply
Additionally, this was a code release based outage(?). It seems like a graph annotated with releases (or a/b testing changes) would have made this outage somewhat trivial to identify.

  2022-10-25 14:39: Multiple teams become involved in the investigation as more customers start reporting increases in errors.
  2022-10-25 17:03: CDN Release is rolled back in Atlanta and root cause is confirmed.
This is what I find worrying.

How does cloudflare track their code releases and does cloudflare annotate their 4 major graphs (errors, rps, utilization, latency) with lines showing when things were pushed to prod?

[+] daniel_iversen|3 years ago|reply
Is this why Apple's iMessage was down for a bit - are they maybe using CloudFlare?
[+] sideproject|3 years ago|reply
Unfortunately, my sites are still not ok. I'm getting HTTP 409 errors on some of the domains that CNAME to my Cloudflare domains. I've looked everywhere and I still cannot get this resolved.
[+] jgrahamc|3 years ago|reply
Can you email me (jgc) details?
[+] jackblemming|3 years ago|reply
Surprised it both took them so long to decide to rollback and that the rollback lasted so long.
[+] mritzmann|3 years ago|reply
> [..] and that the rollback lasted so long

That does not surprise me. There are companies working with CI/CD systems whose tests easily take more than an hour. And until they are completed, they cannot start releasing (enforced via system permissions). This is a good thing, but the possibility of an emergency rollback should be considered beforehand and maybe (it always depends) exceptions made for such purposes.

And sometimes it makes more sense to revert slowly to make sure it doesn't get worse.

[+] yjftsjthsd-h|3 years ago|reply
Yeah,

> Once identified we expedited a rollback which completed in 87 minutes.

I appreciate that it's a massive infrastructure spanning the entire globe, but... 87 minutes to revert a change? I wouldn't want the job of fixing it, but that doesn't seem good enough when the impact is that bad.

[+] wahnfrieden|3 years ago|reply
Related: anyone know the easiest free/self hosted alternative to cloudflare tunnels?
[+] yamtaddle|3 years ago|reply
I'm struggling to understand exactly what it does (via my employer I'm an admin of a CloudFlare account and we use some features outside DNS and proxy-CDN, but far from all of them) but... I wanna say Wireguard?

Of course, anything else is probably gonna have worse UX, because it won't have all the infra pre-built and auto-configured for you.

Also you'll need at least one publicly-accessible server to be the Internet-facing side of it, and that part would necessarily not be free, unless you can get by on some free-tier VM somewhere.

[+] Thaxll|3 years ago|reply
5% is it because you inject traces in 5% of requests?
[+] INeedMoreRam|3 years ago|reply

[deleted]

[+] rubenv|3 years ago|reply
If you feel entitled to be negative; please demonstrate that you are perfect and never make a mistake.

Be a human: mistakes and problems happen, that's a fact of life. What matters is how you act upon them.

[+] iancmceachern|3 years ago|reply
A bit of an aside, but maybe not.

I live on the same block as the Cloudfare headquarters. They have a big common space that has open windows to the street. They have some big projectors that play what are basically Screensavers on the wall in this common space, visible from the street. For months now one of these big wall sized projections has had in large white letters "this copy of office is not legit, please contact Microsoft to purchase a legit license of windows " (I paraphrase a bit). This seems like something they should address, no?

[+] nottorp|3 years ago|reply
I strongly doubt they don't have enough office licenses. What's probably happening is no one can be bothered to fix the user hostile copy protection.

I bet cracked versions of Office don't display that message, so it's an indication that this one is actually legit :)

[+] fogolon|3 years ago|reply
It doesn't surprise me that they are heavy on the marketing and light on the detail. Also fits with every outage they've had large enough for others to notice, the root cause is always some embarrassingly simple cock-up that only an intern would make.
[+] iancmceachern|3 years ago|reply
I'm replying to my own comment to note that I'm getting downvoted for calling out a big company for what seems like software theft. That seems odd as I'm seemingly not the one in the wrong, I'm just sharing a simple observation.
[+] rat9988|3 years ago|reply
Have you ever considered contacting them instead of public shaming?
[+] londons_explore|3 years ago|reply
> Once identified we expedited a rollback which completed in 87 minutes

Roll forwards and back should usually be done slowly and carefully.

However, during a major outage, there is a good reason to do the rollback very fast. Ie. Within 1 minute. Fast rollbacks end any outage sooner. And if your service is already down, you have nothing to lose and everything to gain.

That provides an interesting infrastructure problem. Fast rollbacks probably involve restarting a large number of servers, having a large number of cold caches, etc. You can find new bugs that only rear their heads during a fast rollback. Things like docker registries getting overloaded, kubernetes control plane being overloaded, other services becoming unhealthy due to a large influx of requests, etc.

The traditional approach is 'dont take risks, don't do a fast rollback'. But that can double or more the length of your downtime. Especially if you roll back the wrong thing first.

I'd therefore encourage preparing ahead of time for fast rollbacks. The dev/test environment should automatically do 'big bang' rollbacks every week. Production rollbacks should even be practiced from time to time, getting faster and faster each time, keeping an eye on metrics to look for instability.

[+] oefrha|3 years ago|reply
> And if your service is already down, you have nothing to lose and everything to gain.

They had a 5% request failure rate at the peak of this outage. So a botched rollback could definitely do more harm than good if the impact isn’t 100% understood.

[+] vitus|3 years ago|reply
There have been YouTube outages that were exacerbated precisely because the SRE teams explicitly trained for the disaster scenario where everything is hard down and needs to be brought back as quickly as possible.
[+] avereveard|3 years ago|reply
> The impact lasted for almost six hours in total.

And faster monitoring, they detected the issue after 4.5 hours and took one to fix

[+] quickthrower2|3 years ago|reply
Another chaos monkey case? Just do a fast rollback every day your D10 rolls a 1.