We hired an engineer out of Amazon AWS at a previous company.
Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.
After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.
FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.
Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.
Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.
As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.
No idea what happens on AWS as I don't work there, but I have another perspective on this.
There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.
Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.
That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.
We ended up removing the public dashboard and using other mechanisms to notify customers.
That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.
Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).
I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.
I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...
But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?
Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.
At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.
Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.
Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.
> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.
Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...
Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message
The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.
Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.
Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.
I can imagine that there are literally 100s of engineers involved in trying to fix this ASAP, since this is not only bringing down the systems of external customers, but also critical internal systems, plus the bad PR.
Isn't it common practice to host your status board on someone else's infrastructure?
In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.
Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.
AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.
AWS will be in deep trouble when/if GCE fixes their customer support.
Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.
It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.
How did AWS status page compare with status.io/aws?
Cognito is one of the most frustrating AWS services I have to work with, it is almost, but not quite, entirely unlike an SP.
We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.
Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...
Banner on top of https://status.aws.amazon.com/ just has an update from 8:36AM PST -- just removed -- even thought it's only 7:42AM PST. I guess it's really manual firefighting there.
[+] [-] PragmaticPulp|5 years ago|reply
Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.
After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.
FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.
[+] [-] uji|5 years ago|reply
Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.
As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.
[+] [-] outworlder|5 years ago|reply
There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.
Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.
That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.
We ended up removing the public dashboard and using other mechanisms to notify customers.
[+] [-] Aperocky|5 years ago|reply
[+] [-] Ensorceled|5 years ago|reply
[+] [-] JMTQp8lwXL|5 years ago|reply
[+] [-] ind3mp0tent|5 years ago|reply
I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...
But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?
[+] [-] mathattack|5 years ago|reply
If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.
[+] [-] jmartens|5 years ago|reply
[+] [-] one2know|5 years ago|reply
[+] [-] draw_down|5 years ago|reply
[deleted]
[+] [-] hypervisorxxx|5 years ago|reply
Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.
[+] [-] piewzko|5 years ago|reply
- https://github.com/ory
- https://github.com/dexidp/dex
- https://github.com/authelia/authelia
- https://github.com/keycloak/keycloak
- https://www.gluu.org/
- https://github.com/accounts-js/accounts
[+] [-] grinich|5 years ago|reply
We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs
Here's our HN launch: https://news.ycombinator.com/item?id=22607402
[+] [-] kevindong|5 years ago|reply
[+] [-] simlevesque|5 years ago|reply
[+] [-] lukevp|5 years ago|reply
[+] [-] mooreds|5 years ago|reply
Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.
[+] [-] xyst|5 years ago|reply
ory looks like a really good project
[+] [-] technics256|5 years ago|reply
[+] [-] agustif|5 years ago|reply
[+] [-] rcardo11|5 years ago|reply
Well, this is a major outgage
[+] [-] Schweigi|5 years ago|reply
[+] [-] manishsharan|5 years ago|reply
The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
[+] [-] booleanbetrayal|5 years ago|reply
[+] [-] jpp|5 years ago|reply
[+] [-] odiroot|5 years ago|reply
[+] [-] bart_spoon|5 years ago|reply
[+] [-] cpufry|5 years ago|reply
[+] [-] durkie|5 years ago|reply
[+] [-] holler|5 years ago|reply
[+] [-] driverdan|5 years ago|reply
Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.
[+] [-] ttam|5 years ago|reply
All on the eve of thanksgiving.
[+] [-] Bombthecat|5 years ago|reply
[+] [-] bithavoc|5 years ago|reply
[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20
[1] https://news.ycombinator.com/item?id=24103746
[+] [-] turdnagel|5 years ago|reply
In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.
[+] [-] throwaway343432|5 years ago|reply
AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.
AWS will be in deep trouble when/if GCE fixes their customer support.
[+] [-] s_dev|5 years ago|reply
It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.
How did AWS status page compare with status.io/aws?
[+] [-] zxcvbn4038|5 years ago|reply
[+] [-] drfritznunkie|5 years ago|reply
We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.
Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...
[+] [-] _0o6v|5 years ago|reply
Is that not a massive catch-22 for a service dashboard?
[+] [-] 0xmohit|5 years ago|reply
https://news.ycombinator.com/item?id=3707590
[+] [-] camhart|5 years ago|reply
Last sentence of the alert at the top of the page.
[+] [-] LeoTM|5 years ago|reply
https://downdetector.co.uk/status/visa/map/
I am unable to order my Papa Johns pizza
https://imgur.com/u5QSszv
[+] [-] vishesh92|5 years ago|reply
This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.
[+] [-] Steve886|5 years ago|reply
[+] [-] pluc|5 years ago|reply
[+] [-] dsagal|5 years ago|reply
[+] [-] unilynx|5 years ago|reply
- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour
- The support case I created about it doesn't show up in my support portal. Direct link to it does work though
[+] [-] swasheck|5 years ago|reply