top | item 45972006

(no title)

SteveNuts | 3 months ago

I have a serious question, not trying to start a flame war.

A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.

B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.

Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?

Thanks

discuss

order

wnevets|3 months ago

> A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.

FWIW Microsoft is convinced moving Github to Azure will fix these outages

einsteinx2|3 months ago

The same Azure that just had a major outage this month?

tombert|3 months ago

Microsoft is a company that hasn't even figured out how to get system updating working consistently on their premier operating system in three decades. It seems unlikely to me that somehow moving to Azure is going to make anything more stable.

bovermyer|3 months ago

Microsoft is also convinced that its works are a net benefit for humanity, so I would take that with a grain of salt.

junon|3 months ago

Been on GitHub for a long time. It feels like they're more often. It used to be yearly if at all that GitHub was noticably impacted. Now it's monthly, and recently, seemingly weekly.

0x457|3 months ago

Definitely not how I remember. First, I remember seeing unicorn page multiple times a day some weeks. There were also time when webhook delivery didn't work, so circle ci users couldn't kick off any builds.

What change is how many services GitHub can be having issues.

chadac|3 months ago

I suspect that the Azure migration is influencing this one. Just a bunch of legacy stuff being moved around along with Azure not really being the most reliable on top... I can't imagine it's easy.

zackify|3 months ago

there has been 5 between actions and push pull issues just this month. it is more often

cmrdporcupine|3 months ago

In the early days of GitHub (like before 2010) outages were extremely common.

kkarpkkarp|3 months ago

> If it's becoming more common, what are the reasons?

Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.

smsm42|3 months ago

It certainly feels that way, though it may be an instance of availability bias. Not sure what's causing it - maybe extra load from AI bots (certainly a lot of smaller sites complain about it, maybe major providers feel the pain too), maybe some kind of general quality erosion... It's certainly something that is waiting for a serious research.

myth_drannon|3 months ago

Looking around, I noticed that many senior, experienced individuals were laid off, sometimes replaced by juniors/contractors without institutional knowledge or experience. That's especially evident in ops/support, where the management believes those departments should have a smaller budget.

pm90|3 months ago

Github isn't in the same reliability class as the hyperscalars or cloudflare; its comically bad now, to the point that at a previous job we invested in building a readonly cache layer specifically to prevent github outages from bringing our system down.

never_inline|3 months ago

I think most systems should not on https://github.com at run time (rather than build time - build failures should not bring your system down)?

tingletech|3 months ago

Years ago on hackernews I saw a link about probability describing a statistical technique that one could use to answer a question about if a specific type of event was becoming more common or not. Maybe related to the birthday paradox? The gist that I remember is that sometimes a rare event will seem to be happening more often, when in reality there is some cognitive bias that makes it non-intuitive to make that decision without running the numbers. I think it was a blog post that went through a few different examples, and maybe only one of them was actually happening more often.

ambicapter|3 months ago

If the events are independent, you could use a binomial distribution. Not sure if you can consider these kinds of events to be independent, though.

sunshine-o|3 months ago

1/ Most of the big corporations moved to big cloud providers in the last 5 years. Most of them started 10 years ago but it really accelerated in the last 5 years. So there is for sure more weight and complexity on cloud providers, and more impact when something goes wrong.

2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.

There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.

grayhatter|3 months ago

End of year, pre-holiday break, code/project completion for perf review rush.

Be good to your Stability reliability engineers for the next few months... it's downtime season!

Wowfunhappy|3 months ago

I’m more interested in how this and the Cloudflare outage occurred on the same day. Is it really just a coincidence?

never_inline|3 months ago

I think they're becoming more common because AI -> FOMO -> tighter deadlines on projects "since you can use AI to accelerate your work", which is often not how it works, and last 10% of reliability work is forgotten.

dlenski|3 months ago

> Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now?

I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.

Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.

> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.

I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."

My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:

- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.

- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)

- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople — or even to me — that any of these providers are much more or less reliable than the others.

averageRoyalty|3 months ago

I suspect there is more tech out there. 20 years ago we didn't have smartphones. 10 years ago, 20mbit on mobile was a good connection. Gigabit is common now, infrastructure no longer has the hurdles it used to, AI makes coding and design much easier, phones are ubiquitous and usage of them at all times (in the movies, out and dinner, driving) has become super normalised.

I suspect (although have not researched) that global traffic is up, by throughput but also by session count.

This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.

I think as a society it just has more impact than it used to.

swed420|3 months ago

> B. If it's becoming more common, what are the reasons?

Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.

Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.

There are tons of studies to back this line of reasoning.

__MatrixMan__|3 months ago

I think it's cancer, and it's getting worse.

xmprt|3 months ago

One possibility is increased monitoring. In the past, issues that happened weren't reported because they went under the radar. Whereas now, those same issues which only impact a small percentage of users would still result in a status update and postmortem. But take this with a grain of salt because it's just a theory and doesn't reflect any actual data.

A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.

Kostic|3 months ago

At least with GitHub it's hard to hide when you get "no healthy upstream" on a git push.