top | item 45643848

(no title)

time0ut | 4 months ago

Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.

The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.

Good reminder that you are only as strong as your weakest link.

discuss

SOLAR_FIELDS|4 months ago

This reminds me of the time that Google’s Paris data center flooded and caught on fire a few years ago. We weren’t actually hosting compute there, but we were hosting compute in AWS EU datacenter nearby and it just so happened that the dns resolver for our Google services elsewhere happened to be hosted in Paris (or more accurately it routed to Paris first because it was the closest). The temp fix was pretty fun, that was the day I found out that /etc/hosts of deployments can be globally modified in Kubernetes easily AND it was compelling enough to want to do that. Normally you would never want to have an /etc/hosts entry controlling routing in kube like this but this temporary kludge shim was the perfect level of abstraction for the problem at hand.

citizenpaul|4 months ago

> temporary kludge shim was the perfect level of abstraction for the problem at hand.

Thats some nice manager deactivating jargon.

jordanb|4 months ago

Couldn't you just patch your coredns deployment to specify different forwarders?

nahumba|4 months ago

This is the en of the thread of the first comment. Now i can find below the second comment

1970-01-01|4 months ago

I remember Facebook had a similar story when they botched their BGP update and couldn't even access the vault. If you have circular auth, you don't have anything when somebody breaks DNS.

crote|4 months ago

Wasn't there an issue where they required physical access to the data center to fix the network, which meant having to tap in with a keycard to get in, which didn't work because the keycard server was down, due to the network being down?

bcrl|4 months ago

That's similar to the total outage of all Rogers services in Canada back on July 7th 2022. It was compounded by the fact that the outage took out all Rogers cell phone service, making it impossible for Rogers employees to communicate with each other during the outage. A unified network means a unified failure mode.

Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!

ttul|4 months ago

There is always that point you reach where someone has to get on a plane with their hardware token and fly to another data centre to reset the thing that maintains the thing that gives keys to the thing that makes the whole world go round.

beefnugs|4 months ago

So sick of billion dollar companies not hiring that one more guy

vladvasiliu|4 months ago

> Identity Center and only put it in us-east-1

Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.

raverbashing|4 months ago

Security people and ignoring resiliency and failure modes: a tale as old as time

AndrewKemendo|4 months ago

Correct. That does make it a centralized failure mode and everyone is in the same boat on that.

I’m unaware of any common and popular distributed IDAM that is reliable

barbazoo|4 months ago

Wow, you really *have* to exercise the region failover to know if it works, eh? And that confidence gets weaker the longer it’s been since the last failover I imagine too. Thanks for sharing what you learned.

shdjhdfh|4 months ago

You should assume it will not work unless you test it regularly. That's a big part of why having active/active multi-region is attractive, even though it's much more complex.

ej_campbell|4 months ago

That wouldn't have even caught that, most likely unless they verified they had no incidental tie ins with us-east-1.

jpollock|4 months ago

The last place I worked actively switched traffic over to the backup nodes regularly (at least monthly) to ensure we could do it when necessary.

We learned that lesson by having to do emergency failovers and having some problems. :)

shawabawa3|4 months ago

for what it's worth, we were unable to login with root credentials anyway

i don't think any method of auth was working for accessing the AWS console

kondro|4 months ago

Sure it was, you just needed to login to the console via a different regional endpoint. No problems accessing systems from ap-southeast-2 for us during this entire event, just couldn’t access the management planes that are hosted exclusively in us-east-1.

nijave|4 months ago

Like the other poster said, you need to use a different region. The default region (of course) sends you to us-east-1

e.x. https://us-east-2.console.aws.amazon.com/console/home

reenorap|4 months ago

It's a good reminder actually that if you don't test the failover process, you have no failover process. The CTO or VP of Engineering should be held accountable for not making sure that the failover process is tested multiple times a month and should be seamless.

sroussey|4 months ago

If you don’t regularly restore a backup, you don’t have one.

hinkley|4 months ago

Too much armor makes you immobile. Will your security org be held to task for this? This should permanently slow down all of their future initiatives because it’s clear they have been running “faster than possible” for some time.

Who watches the watchers.

ej_campbell|4 months ago

Totally ridiculous that AWS wouldn't by default make it multi-region and warn you heavily that your multi-region service is tied to a single region for identity.

The usability of AWS is so poor.

skywhopper|4 months ago

They don’t charge anything for Identity Center and so it’s not considered an important priority for the revenue counters.

ct520|4 months ago

I always find it interesting how many large enterprises have all these DR guidelines but fail to ever test. Glad to hear that everything came back alright

ransom1538|4 months ago

People will continue to purchase Mutli-AZ and multi-region even though you have proved what a scam it is. If east region goes down, ALL amazon goes down, feel free to change my mind. STOP paying double rates for multi region.

ozim|4 months ago

Sounds like a lot of companies need to update their BCP after this incident.

michaelcampbell|4 months ago

"If you're able to do your job, InfoSec isn't doing theirs"

ct_list|4 months ago

[deleted]

saltserv|4 months ago

[deleted]