top | item 46268418

(no title)

l1n | 2 months ago

Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident.

The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future.

discuss

ammut|2 months ago

If you have a good network CI/CD pipeline and can trace the time of deployment to when the errors began, it should be easy to reduce your total TTD/TTR. Even when I was parsing logs years ago and matching them up against AAA authorization commands issued, it was always a question of "when did this start happening?" and then "who made a change around that time period?"

giancarlostoro|2 months ago

I don't know if you guys do write ups, but cloudflare's write ups on outages is in my eyes the gold standard the entire industry should follow.

Arcuru|2 months ago

When I was at Big Corp, I loved reading the internal postmortems. They were usually very interesting and I learned a lot. It's one of the things I miss about leaving.

A tech company that publishes the postmortems when possible always get a +1 in my eyes, I think it's a sign of good company culture. Cloudflare's are great and I would love to see more from others in the industry.

boondongle|2 months ago

A big reason for that is it comes from the CEO. Other providers have a team and then at least 2 to 3 layers of management above them and a dotted line legal counsel. So the goal posts randomly shift from "more information" to "no information" over time based on the relationships of that entire chain, the customer heat of the moment, and personality.

Underneath a public statement they all have extremely detailed post-mortems. But how much goes public is 100% random from the customer's perspective. There's no Monday Morning QB'ing the CEO, but there absolutely is "Day-Shift SRE Leader Phil"

bflesch|2 months ago

Cloudflare deploys stuff on Fridays, and it directly affected shopify, one of their major ecommerce customers. Until they fix their internal processes all writeups should be seen as purely marketing material.

999900000999|2 months ago

Was this a typo situation or a bad process thing ?

Back when I did website QA Automation I'd manually check the website at the end of my day. Nothing extensive, just looking at the homepage for piece of mind.

Once a senior engineer decided to bypass all of our QA, deploy and took down prod. Fun times.

spike021|2 months ago

Depending on how long someone's been in the industry it's more a question of if, not when, an outage will occur due to someone deciding to push code haphazardly.

At my first job one of my more senior team members would throw caution to the wind and deploy at 3pm or later on Fridays because he believed in shipping ASAP.

There were a couple times that those changes caused weekend incidents.

userbinator|2 months ago

In these times, it could be "the AI did it".

weird-eye-issue|2 months ago

[deleted]

tayo42|2 months ago

I was kind surprised to see details like that in a comment, but clicked on your personal website and see your a Co-founder, so I guess no one is going to repremand you lol

wouldbecouldbe|2 months ago

Trying to understand what this means.

Did the bad route cause an overload? Was there a code error on that route that wasn’t spotted? Was it a code issue or an instance that broke?

bc569a80a344f9c|2 months ago

It says network routing issue.

Network routes consist of a network (a range of IPs) and a next hop to send traffic for that range to.

These can overlap. Sometimes that’s desirable, sometimes it is not. When routers have two routes that are exactly the same they often load balance (in some fairly dumb, stateless fashion) between possible next hops, when one of the routes is more specific, it wins.

Routes get injected by routers saying “I am responsible for this range” and setting themselves as the next hop, others routers that connect to them receive this advertisement and propagate it to their own router peers further downstream.

An example would be advertising 192.168.0.0/23, which is the range of 192.168.0.0-192.168.1.255.

Let’s say that’s your inference backend in some rows in a data center.

Then, through some misconfiguration, some other router starts announcing 192.168.1.0/24 (192.168.1.0-192.168.1.255). This is more specific, that traffic gets sent there, and half of the original inference pod is now unreachable.

mattdeboard|2 months ago

it means their servers were unreachable due to network misconfig.

colechristensen|2 months ago

The details and promptness of reporting are much appreciated and build trust, so thanks!