top | item 26880147

Auth0 Has been down for almost 4 hours now

195 points| inssein | 5 years ago | reply

Seems I can't link to the incident (gets marked as a deadlink), but here it is: https://status.auth0.com/incidents/zvjzyc7912g5?u=3qykby4vypfp

101 comments

[+] UglyToad|5 years ago|reply

So I've been mulling this stupid thought for a while (and disclaimer that it's extremely useful for these outage stories to make it to the front-page to help everyone who is getting paged with p1s out).

But, does it really matter?

I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence.

However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something.

My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours.

[+] fastball|5 years ago|reply

Yes and no, some things are actually time sensitive.

For example, I'm building a note-taking / knowledge base platform, and we were having some reliability issues last year when our platform and devops process was still a bit nascent. We had a user that was (predictably) using our platform to take notes / study for an exam, which was open book. On the day of her exam our servers went down and she was justifiably anxious that things wouldn't be back before it was time for her exam to start. Luckily I was able to stabilize everything before then and her exam went great in the end, but it might not have happened that way.

Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally, but I of course took this as a personal mission to improve our reliability so that our users never had to deal with this again. And since then I'm proud to say we've maintained 99.99% uptime[1]. So yes, there are definitely many situations where we can and should take a more laid back approach, but sometimes there are deadlines outside of your control and having a critical piece of software go offline exactly when you need it can be a terrible experience.

[1] https://status.supernotes.app/

[+] _jal|5 years ago|reply

> But, does it really matter?

This is a great line of thought, I'd encourage everyone to take it. There's a huge amount of crap people get up to that is mostly about performative debt balancing - people feel that they're owed something just because <fill in the blank>, when it really didn't matter. Just another gross aspect of a culture overly reliant on litigation for conflict management.

But. the question is meaningless without qualifying, for whom?

Because I can absolutely imagine situations where an Auth0 outage could be extremely damaging, expensive, or both. Same for a lot of other services.

> Life doesn't have an SLA

Nope. Which is a part of the reason why people spend money on them for certain specific things. It is just another form of insurance against risk.

[+] SahAssar|5 years ago|reply

For a lot of stuff I agree but the problem is that (some of) these platforms advertise themselves as being built so that this should not happened. Less cynical engineers will then build some critical solutions that depend on these platforms and assume that they can and have successfully mitigated the risk of downtime. Sometimes the tools to manage/communicate/fix the service downtime are even dependent on the service being up.

The lesson is more that everything fails all of the time and the more interconnected and dependent we make things the more they fail. That is not something that can be solved with another SaaS as multiple downtimes, hacks, leaks and shutdowns have shown time and time again.

[+] bombcar|5 years ago|reply

The problem is that the "small guy" is held to a high standard that the "big guy" isn't held to. If AWS shits itself for a day nothing will happen, if your small SaaS goes down for an hour you'll lose customers and people will yell at you.

And more importantly, if YOU try to use something "not big" and it goes down, it's on YOU - but if you're using Azure and it goes down, it's "what happens".

[+] phpnode|5 years ago|reply

I think you're underestimating the scope of the impact and just how vital software is in the modern world. It's not just that people can't login to a system, it's that they simply can't get their work done, and some of that work is really very time sensitive and important. Auth0 is depended on by hundreds of thousands of companies. Tens of millions of people will have been impacted by this outage today.

[+] sega_sai|5 years ago|reply

I appreciate this view, but I'm in academia, and with covid19 we are teaching remotely, doing exams remotely, etc. If the systems are down that can have a real disrupting effect on students not being able to submit homeworks/exams, us delivering lectures. And that potentially applies for the whole university (thousands of people).

[+] dataflow|5 years ago|reply

Not regarding this specific incident, but to reply to this:

> However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

I've been mulling this for a while too, and I think I might have some responses that address your thought somewhat:

- Amazon/Google/Microsoft/etc. services have huge blast radii. If you build your own system independently, then of course you probably wouldn't achieve as high of an SLA, but from the standpoint of users, they (usually) still have alternative/independent services they can still use simultaneously. That decoupling can drastically reduce the negative impact on users, even if the individual uptimes are far worse than the global one.

- Sometimes it turns out problems were preventable, and only occurred because someone deliberately decided to bypass some procedures. These are always irritating regardless of the fact that nobody can reach 100% uptime. And I think sometimes people get annoyed because they feel there's a non-negligible chance this was the cause, rather than (say) a volcano.

- People really hate it when the big guys go down, too.

[+] NicoJuicy|5 years ago|reply

> why does a few hours of downtime ultimately matter?

In our case ( Azure downtime), because none of our customer systems would work.

This includes people on the road, that need to do something every 5 minutes on their PDA ( sometimes 100 people simultaneous in a big city)

So yes, it matters.

[+] daniel-grigg|5 years ago|reply

Even if that were true for a single system in isolation, it breaks apart quickly the number of services you’re ‘dependent’ increases. Then that relatively rare downtime of 1% starts to grow until every day, ‘something’ is broken.

[+] OldHand2018|5 years ago|reply

> Why are any of us going to do any better and why does a few hours of downtime ultimately matter?

The answer is surprisingly simple.

Most outages are the unintended result of someone doing something. When you are doing things yourself, you schedule the “doing something” for times when an outage would matter least.

If you are the kind of place where there is no such time, you mitigate. Backup systems, designing for resiliency, hiring someone else, etc.

[+] pan69|5 years ago|reply

I agree with you. Sometimes things break, such is life. What I don't fully understand is that when people choose to outsource a critical part of their infrastructure and then complain when it happens to be down for a bit. It was a trade-off that was made.

[+] sneak|5 years ago|reply

> But, does it really matter?

I think an important consideration here is that a huge amount of time, money, and resources is spent on making sure the computers stay powered and cooled in all manner of situations. We contract redundant diesel delivery for generators, we buy and install gigantic diesel generator systems which are used for just minutes per year, huge automatic grid transfer switches, redundant fiber optic loops, dynamic routing protocols, N+1 this and double-redundant that. It's tremendously expensive in terms of money, human time, and physical/natural resources.

The point is that we are always striving to plan for failures, and engineering them out. When there is a real life actual outage, it means, necessarily, based on the huge amount of time and money and resources invested in planning around disaster/failure resilience, that the plan has a bug or an error.

Somebody had a responsibility (be it planning, engineering, or otherwise) that was not appropriately fulfilled.

Sure, they'll find it, and update their plan, and be able to respond better in the future - but the fundamental idea is that millions (billions?) have been spent in advance to prevent this from happening. That's not nothing.

[+] russellendicott|5 years ago|reply

I can definitely get on-board with this. When AWS or Azure has some outage they pull me into calls and ask me what to do. These vendors are so large it's like asking me for my advice on the weather. Everything is screwed, man. Just hunker down and go read a book or something.

[+] rdegges|5 years ago|reply

I agree. I actually wrote something up about this back in 2015: https://www.rdegges.com/2015/obsessing-over-availability-is-...

[+] greycol|5 years ago|reply

I agree with this sentiment. Though there is of course a bit of a problem when you're dealing with people who don't.

I'd also highlight that when the big players go down people 'know' it's not your fault, when a small 3rd party provider goes down taking part of your service with it it's 'because you didn't do due diligence' or were trying to save a buck. Similar in a way to the anachronism 'no one got fired for buying IBM'

[+] matwood|5 years ago|reply

> why does a few hours of downtime ultimately matter

I think people know this implicitly, but it's good to think about it explicitly. Does downtime matter, and how much is acceptable should be a question every system has decided on. Because ultimately uptime cost money, and many who are complaining about this outage are likely not paying anywhere near what it would cost to truly deliver 5+x9s or Space Shuttle level code quality.

[+] oliwarner|5 years ago|reply

That's a lovely viewpoint to be able to take about one's own priorities, but one that's hard to sell to the person at the entity, ultimately paying all your bills.

Yes, people should relax a bit, but those incidents you cite did cost those companies customers. That's okay for Amazon. But a small B2B service provider can't as easily absorb the loss.

[+] sneak|5 years ago|reply

> Just catch up on emails or something.

Hard to do when you can't authenticate to the email webapp.

[+] shakezula|5 years ago|reply

We build these massively distributed, micro-concerned, mega-scaled systems, and at every step we recognize everything and anything can go wrong at any given moment, mulling over these problems on a daily basis.

And then it /does/ and all of us lose our shit haha.

[+] ivan888|5 years ago|reply

This is a really interesting point that I hadn't considered before.

It's similar to ubiquitous next day delivery conditioning people to find anything longer unacceptable, when cheap next day is quite new and not even the norm yet.

[+] spondyl|5 years ago|reply

Ah, a comment where I can put on my SRE (Site Reliability Engineering) hat :)

You're completely right that a 100% availability is unreasonable and often times, never required despite what a customer or site operator may believe.

Just a quick aside, availability (can an end user reach your thing) is often confused with uptime (is your thing up). If I operate a load balancer that your service sits behind and my load balancer dies, your service is up, but not availabile for those on the other side of said load balancer.

With that in mind, Hacker News could be theoretically up 100% of the time but if I go through a tunnel while scrolling Hacker News on my mobile phone, from my perspective, it is no longer 100% available, it is 100% - (period I was without signal) available, from my personal perspective as a user.

The point here is that a whole host of unreliable things happen in every day life from your router playing up to sharks biting the undersea cables.

With that in mind, you then want to go and figure out a reasonable level of service to provide to your end users (ask for their input!) that reflects reality.

It's worth noting too that Google (I don't love 'em but they pioneered the field) will actually intentionally disrupt services if they're "too available" so as to keep those downstream on their toes. It's not actually good for anyone if you have 100% availability in that they make too many assumptions and also, it's just good practice I suppose.

I can recommend reading the SLOs portion of the Google SRE book if you're curious to see more: https://sre.google/sre-book/service-level-objectives/

In short, an SLO is just an SLA without the legal part so a guarantee of a certain level of service, often internally from one team to another.

Ideally these objectives reflect the level of service your customers (internal or external) expect from your service

> Chubby [Bur06] is Google’s lock service for loosely coupled distributed systems. In the global case, we distribute Chubby instances such that each replica is in a different geographical region.

> Over time, we found that the failures of the global instance of Chubby consistently generated service outages, many of which were visible to end users. As it turns out, true global Chubby outages are so infrequent that service owners began to add dependencies to Chubby assuming that it would never go down. Its high reliability provided a false sense of security because the services could not function appropriately when Chubby was unavailable, however rarely that occurred.

> The solution to this Chubby scenario is interesting: SRE makes sure that global Chubby meets, but does not significantly exceed, its service level objective. In any given quarter, if a true failure has not dropped availability below the target, a controlled outage will be synthesized by intentionally taking down the system.

> In this way, we are able to flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.

[+] slackerIII|5 years ago|reply

Huh, that's interesting timing. I co-host a podcast that walks through notable outages, and just yesterday we released an episode about Auth0's 2018 outage: https://downtimeproject.com/podcast/auth0-silently-loses-som...

Last time was due to several factors, but initially because of silently losing some indexes during a migration. I'm very curious what happened this time -- we'll definitely do a followup episode if they publish a postmortem.

[+] ryandvm|5 years ago|reply

Not going to lie, of all the things to farm out to a 3rd party, auth/users always struck me as the dumbest.

[+] Raidion|5 years ago|reply

I mean, it has the same benefit of other SaaS, you get to avoid building something and can spend that dev time on building something that solves a unique problem AND you have the benefit of knowing that you get to focus 100% on your app or site's problems and features, and that you have the entirety of Auth0 focusing on keeping your authentication working. I can promise Auth0 is better at building scalable, secure, and resilient authentication solutions than most dev teams, and I've been on a team that's built out a 1000s of logins/hour and 100k requests/hour enterprise grand IDAM solution.

If it's data security or something else that's your concern, you can host the data in your own database with their enterprise package.

General disclaimer: I'm a paying Auth0 customer but just use it for authentication, and it saved me a hundred hours of work for a pretty reasonable price.

[+] streblo|5 years ago|reply

Auth is both simple and hard to get right. It's virtually the same everywhere. One group of people getting it right is better than every company trying to figure it out for themselves. It's exactly the right thing to farm out to a 3rd party.

Only on HN will you be told "you're an idiot if you outsource your auth" and "you're an idiot if you roll your own auth" by the same group of people.

[+] dmlittle|5 years ago|reply

It depends on your needs. What if you provide a SSO solution in your product, your customer is using Okta (or any other IdP) and that IdP goes down? There's nothing really you can do then unless you have other means of authentication.

[+] jtsiskin|5 years ago|reply

To me the exact opposite - it’s seems like a prime candidate to be a third party service.

It’s something easy to get wrong, and has a long tail of work which is extremely generic (supporting all the different social logins, two factor authentication, password reset emails, email verification, sms phone number verification, rate limiting, etc...)

[+] lazyasciiart|5 years ago|reply

Really? Because of all the things I don’t want [myself or my colleagues] to write, a secure authentication management system that connects to multiple with providers is up there.

[+] sneak|5 years ago|reply

Really the only case where it makes sense to farm something like this out is to Google (if Google and the US military aren't in your threat model) because Google's G Suite login system (which can be used as an IdP) is, as far as I can tell, the exact same one they use for @google.com.

Incentives are perfectly aligned there, and if anyone can keep a system running and secure (to everyone except the US military which can compel them), it's them.

[+] romanhotsiy|5 years ago|reply

Previous discussion: https://news.ycombinator.com/item?id=26876287

[+] gjsman-1000|5 years ago|reply

I literally, today, had a demo of SSO for my organization and was panicking over what went wrong when it wasn't working, so I had to skip it.

[+] Jack000|5 years ago|reply

Auth0's pricing has always seemed really strange - 7000 active users for free but only 1000 on the lowest paid tier ($23/month). This means if you don't care about the extra features, once you exceed 7k you need to jump up to the $228/month plan.

[+] okhuman|5 years ago|reply

wrote https://github.com/pmprosociety/authcompanion to try and bring auth back on-prem.

[+] trog|5 years ago|reply

My first Auth0 experience was a couple weeks ago when I had a quick crack at testing it out to see if it would be a suitable candidate to migrate a bunch of WordPress sites (currently all with their own separate, individual user accounts) onto.

I didn't spend a lot of time on it but initially figured it would be easy because they had what seemed to be a well-written and comprehensive blog post[1] on the topic, as well as a native plugin.

But I found a few small discrepancies with the blog post and the current state of the plugin (perhaps not too surprising; the blog post is 2 years old now and no doubt the plugin has gone through several updates).

I found the auth0 control panel overwhelming at a glance and didn't want to spend the time to figure it all out - basically laziness won here, but I feel like they missed an opportunity to get a customer if they'd managed to make this much more low effort.

I moved on to something else (had much better luck with OneLogin out of the box!), but then got six separate emails over the next couple weeks from a sales rep asking if I had any questions.

I'm sure it's a neat piece of kit in the right hands or with a little more elbow grease but I was a bit disappointed with how much effort it was to get up and running for [what I thought was] a pretty basic use case.

1. https://auth0.com/blog/wordpress-sso-with-auth0/

[+] aleyan|5 years ago|reply

Is it worthwhile to do authentication via SaaS instead of a local library?

For password use case, it seems nice that you don't have to store client secrets (eg encrypted salted passwords) on your own infra. However now instead of authentication happening between your own servers and the users browser, there is an additional hop to the SaaS and now you need to learn about JWT etc. At my previous company, moving a Django monolith to do authentication via auth0 was a multi month project and a multi thousand line increase in code/complexity. And we weren't storing passwords to begin with because we were using onetime login emails links.

Maybe SaaS platforms are worth it for social login? I haven't tried that, but I am not convinced that auth0 or some one else can help me connect with facebook/twitter/google better than a library can.

[+] TameAntelope|5 years ago|reply

It's terrifying to store credentials. I'll take 4 hours of downtime once in a blue moon over lost nights of sleep over potential security breaches.

I just can't even imagine why you would these days, there are even "local" options that act as "local 3rd party auth providers".

[+] temikus|5 years ago|reply

Generally it’s not the auth itself that is the problem but RBAC, multi-factor auth, integrations, etc.

We’ve looked at Auth0 and Okta because we wanted to see if we can save some dev time devising RBAC and supporting a lot of different auth integrations. Ended up doing it in house since the quote was unacceptable (essentially a mid-level dev salary per year)

[+] keithnz|5 years ago|reply

Out of interest, what are peoples experience like with self hosted identity management options? I've been evaluating keycloak recently, and it seems pretty good.

[+] pdx6|5 years ago|reply

The Auth0 team is probably distracted by their Okta onboarding. When I was onboarding at Okta after they bought the startup I was working at, I had to support both systems to bring myself up to speed fast -- and that caused some outages from double on call.

[+] 1cvmask|5 years ago|reply

What was the startup? Stormpath perhaps?

[+] princesse|5 years ago|reply

Okta did not acquire Auth0 yet.

[+] inssein|5 years ago|reply

Link: https://status.auth0.com/incidents/zvjzyc7912g5?u=3qykby4vyp...

[+] coopreme|5 years ago|reply

How does Auth0 compare to keycloak? Is it similar?

[+] twistedpair|4 years ago|reply

Final RCA: https://cdn.auth0.com/blog/Detailed_Root_Cause_Analysis_(RCA...

TL;DR feature flag service was to blame

[+] unknown|5 years ago|reply

[deleted]

[+] mattbnr32|5 years ago|reply

just successfully authenticated a few times

[+] f430|5 years ago|reply

isn't the whole purpose of using Auth0 so that this stuff never happens?

[+] whydoineedthis|5 years ago|reply

no, it's that it happens less and when it does it's not so much your engineers problems as you have already paid someone to fix it.

also, security practices are supposedly better and more robust there than at your average place.

i think those two things are the value adds.

[+] lawwantsin17|5 years ago|reply

[deleted]