top | item 25209773

AWS Cognito is having issues and health dashboards are still green

492 points| rcardo11 | 5 years ago |status.aws.amazon.com | reply

349 comments

order
[+] PragmaticPulp|5 years ago|reply
We hired an engineer out of Amazon AWS at a previous company.

Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.

After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.

FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.

[+] uji|5 years ago|reply
Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.

Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.

As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.

[+] outworlder|5 years ago|reply
No idea what happens on AWS as I don't work there, but I have another perspective on this.

There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.

Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.

That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.

We ended up removing the public dashboard and using other mechanisms to notify customers.

[+] Aperocky|5 years ago|reply
That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.
[+] Ensorceled|5 years ago|reply
"Shooting the messenger" is so common that we, well, have a phrase for it.
[+] JMTQp8lwXL|5 years ago|reply
Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).
[+] ind3mp0tent|5 years ago|reply
I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.

I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...

But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?

[+] mathattack|5 years ago|reply
This sounds like a managerial incentives problem.

If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.

[+] jmartens|5 years ago|reply
Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.
[+] one2know|5 years ago|reply
At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.
[+] hypervisorxxx|5 years ago|reply
Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.

Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.

[+] piewzko|5 years ago|reply
Now is probably a good time to plug some of the open source alternatives to vendor locked in identity solutions:

- https://github.com/ory

- https://github.com/dexidp/dex

- https://github.com/authelia/authelia

- https://github.com/keycloak/keycloak

- https://www.gluu.org/

- https://github.com/accounts-js/accounts

[+] kevindong|5 years ago|reply
I'd expect Amazon to be better able to maintain uptime than a self-hosted option at most (but not all) companies.
[+] lukevp|5 years ago|reply
Fusionauth is pretty cool. I’ve worked with the team a bit on the .net core support.
[+] xyst|5 years ago|reply
im surprised companies still want to build their own identity system or pay companies (ping, auth0) to host it for them

ory looks like a really good project

[+] technics256|5 years ago|reply
Anyone have thoughts on their experience with keycloak?
[+] agustif|5 years ago|reply
add AccountsJS, a small nice modular typescript/js lib for building account systems easily
[+] rcardo11|5 years ago|reply
> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.

Well, this is a major outgage

[+] Schweigi|5 years ago|reply
Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...
[+] manishsharan|5 years ago|reply
Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message

The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.

[+] booleanbetrayal|5 years ago|reply
This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.
[+] jpp|5 years ago|reply
We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.
[+] odiroot|5 years ago|reply
It's always a DNS issue.
[+] bart_spoon|5 years ago|reply
I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.
[+] cpufry|5 years ago|reply
and this is whats disclosed to the public
[+] durkie|5 years ago|reply
seeing issues with scaling up/down in elastic beanstalk too
[+] holler|5 years ago|reply
yep, iot in us-east-1 not working for me
[+] driverdan|5 years ago|reply
Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.

Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.

[+] ttam|5 years ago|reply
I can imagine that there are literally 100s of engineers involved in trying to fix this ASAP, since this is not only bringing down the systems of external customers, but also critical internal systems, plus the bad PR.

All on the eve of thanksgiving.

[+] Bombthecat|5 years ago|reply
I think the deeper problem is the interconnectivity between services and their apis. It's too complicated to maintain...
[+] turdnagel|5 years ago|reply
Isn't it common practice to host your status board on someone else's infrastructure?

In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.

[+] throwaway343432|5 years ago|reply
Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.

AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.

AWS will be in deep trouble when/if GCE fixes their customer support.

[+] s_dev|5 years ago|reply
Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.

It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.

How did AWS status page compare with status.io/aws?

[+] zxcvbn4038|5 years ago|reply
I think we are learning everything that uses AWS Kinesis internally which is cool. It’s always fascinating to learn how AWS works on the backend.
[+] drfritznunkie|5 years ago|reply
Cognito is one of the most frustrating AWS services I have to work with, it is almost, but not quite, entirely unlike an SP.

We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.

Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...

[+] _0o6v|5 years ago|reply
> It's not posted on SHD as the issue has impacted our ability to post there.

Is that not a massive catch-22 for a service dashboard?

[+] camhart|5 years ago|reply
"This issue has also affected our ability to post updates to the Service Health Dashboard."

Last sentence of the alert at the top of the page.

[+] vishesh92|5 years ago|reply
> "This issue has also affected our ability to post updates to the Service Health Dashboard."

This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.

[+] pluc|5 years ago|reply
Rule #1 of status pages: never put your status page on the same infrastructure it monitors.
[+] dsagal|5 years ago|reply
Banner on top of https://status.aws.amazon.com/ just has an update from 8:36AM PST -- just removed -- even thought it's only 7:42AM PST. I guess it's really manual firefighting there.
[+] unilynx|5 years ago|reply
There's a lot more going on over there...

- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour

- The support case I created about it doesn't show up in my support portal. Direct link to it does work though

[+] swasheck|5 years ago|reply
Ah yes. It's the annual AWS Thanksgiving Holiday major us-east outage.