top | item 25210612

(no title)

opmac | 5 years ago

It is kind of perplexing that AWS dogfoods its own status page. I remember during the massive S3 outage a few years ago that their status page remained green almost the entire time because the red/green/blue icons for the status was stored in... wait for it... S3.

You'd think they would have learned from that.

discuss

order

Twirrim|5 years ago

They did. It came up in the post incident report, and senior leadership kicked off work to have it run on its own distinct infrastructure so that this wouldn't happen again.

If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.

If you look at the source code for the site, you'll again see that everything is hosted from the same domain.

One of their main goals was to ensure that it could never go wrong that way again.

sleepybrett|5 years ago

Except they posted this: 7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in error rates for Kinesis in the US-EAST-1 Region. It's not post on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

(SHD being the Service Health Dashboard)

opmac|5 years ago

K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

ti_ranger|5 years ago

> It is kind of perplexing that AWS dogfoods its own status page.

> You'd think they would have learned from that.

They did.

The page has been updated numerous times since the start of this incident.

opmac|5 years ago

From the status page:

> This issue has also affected our ability to post updates to the Service Health Dashboard.

Just seems so ridiculous that they have trouble reporting the impaired status of their system due to... the impaired status of that same system.

aledalgrande|5 years ago

it was 1.5 hours before the first service was put on yellow