top | item 46268013

(no title)

palcu | 2 months ago

Hello, I'm one of the engineers who worked on the incident. We have mitigated the incident as of 14:43 PT / 22:43 UTC. Sorry for the trouble.

discuss

l1n|2 months ago

Also an engineer on this incident. This was a network routing misconfiguration - an overlapping route advertisement caused traffic to some of our inference backends to be blackholed. Detection took longer than we’d like (about 75 minutes from impact to identification), and some of our normal mitigation paths didn’t work as expected during the incident.

The bad route has been removed and service is restored. We’re doing a full review internally with a focus on synthetic monitoring and better visibility into high-impact infrastructure changes to catch these faster in the future.

ammut|2 months ago

If you have a good network CI/CD pipeline and can trace the time of deployment to when the errors began, it should be easy to reduce your total TTD/TTR. Even when I was parsing logs years ago and matching them up against AAA authorization commands issued, it was always a question of "when did this start happening?" and then "who made a change around that time period?"

giancarlostoro|2 months ago

I don't know if you guys do write ups, but cloudflare's write ups on outages is in my eyes the gold standard the entire industry should follow.

999900000999|2 months ago

Was this a typo situation or a bad process thing ?

Back when I did website QA Automation I'd manually check the website at the end of my day. Nothing extensive, just looking at the homepage for piece of mind.

Once a senior engineer decided to bypass all of our QA, deploy and took down prod. Fun times.

tayo42|2 months ago

I was kind surprised to see details like that in a comment, but clicked on your personal website and see your a Co-founder, so I guess no one is going to repremand you lol

wouldbecouldbe|2 months ago

Trying to understand what this means.

Did the bad route cause an overload? Was there a code error on that route that wasn’t spotted? Was it a code issue or an instance that broke?

colechristensen|2 months ago

The details and promptness of reporting are much appreciated and build trust, so thanks!

giancarlostoro|2 months ago

Any chance you guys could do write ups on these incidents similar to how CloudFlare does? For all the heat some people give them, I trust CloudFlare more with my websites than a lot of other companies because of their dedication to transparency.

l1n|2 months ago

We're considering this!

nickpeterson|2 months ago

The one time you desperately need to ask Claude and it isn’t working…

dan_wood|2 months ago

Can you divulge more on the issue?

Only curious as a developer and dev op. It's all quite interesting where and how things go wrong especially with large deployments like Anthropic.

binsquare|2 months ago

I yearn for the nitty gritty details too

dgellow|2 months ago

Hope you have a good rest of your weekend

Chance-Device|2 months ago

Thank you for your service.

g-mork|2 months ago

it's still down get back to work