(no title)
SteveNuts | 3 months ago
A. Are these major issues with cloud/SaaS tools becoming more common, or is it just that they get a lot more coverage now? It seems like we see major issues across AWS, GCP, Azure, Github, etc. at least monthly now and I don't remember that being the case in the past.
B. If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
Operations budget cuts/layoffs? Replacing critical components/workflows with AI? Just overall growing pains, where a service has outgrown what it was engineered for?
Thanks
wnevets|3 months ago
FWIW Microsoft is convinced moving Github to Azure will fix these outages
Lammy|3 months ago
https://www.zdnet.com/article/ms-moving-hotmail-to-win2000-s...
https://jimbojones.livejournal.com/23143.html
einsteinx2|3 months ago
tombert|3 months ago
bovermyer|3 months ago
junon|3 months ago
0x457|3 months ago
What change is how many services GitHub can be having issues.
chadac|3 months ago
zackify|3 months ago
cmrdporcupine|3 months ago
kkarpkkarp|3 months ago
Someone answered this morning, while Cloudflare outage, it's AI vibe coding and I tend to think there is something true in this. At some point there might be some tiny grain of AI engaged which starts the avalanche ending like this.
AIorNot|3 months ago
https://techrights.org/n/2025/08/12/Microsoft_Can_Now_Stop_R...
ever since Musk greenlighted firing people again.. CEOs can't wait to pull the trigger
smsm42|3 months ago
myth_drannon|3 months ago
pm90|3 months ago
never_inline|3 months ago
tingletech|3 months ago
ambicapter|3 months ago
sunshine-o|3 months ago
2/ Then we cannot expect big tech to stay as sharp as in the 2000s and 2010s.
There was a time banks had all the smart people, then the telco had them, etc. But people get older, too comfortable, layers of bad incentive and politics accumulate and you just become a dysfunctional big mess.
grayhatter|3 months ago
Be good to your Stability reliability engineers for the next few months... it's downtime season!
Wowfunhappy|3 months ago
never_inline|3 months ago
dlenski|3 months ago
I think that "more coverage" is part of it, but also "more centralization." More and more of the web is centralized around a tiny number of cloud providers, because it's just extremely time-intensive and cost-prohibitive for all but the largest and most specialized companies to run their own datacenters and servers.
Three specific examples: Netflix and Dropbox do run their own datacenters and servers; Strava runs on AWS.
> If it's becoming more common, what are the reasons? I can think of a few, but I don't know the answer, so if anyone in-the-know has insight I'd appreciate it.
I worked at AWS from 2020-2024, and saw several of these outages so I guess I'm "in the know."
My somewhat-cynical take is that a lot of these services have grown enormously in complexity, far outstripping the ability of their staff to understand them or maintain them:
- The OG developers of most of these cloud services have moved on. Knowledge transfer within AWS is generally very poor, because it's not incentivized, and has gotten worse due to remote work and geographic dispersion of service teams.
- Managers at AWS are heavily incentivized to develop "new features" and not to improve the reliability, or even security, of their existing offerings. (I discovered numerous security vulnerabilities in the very-well-known service that I worked for, and was regularly punished-rather-than-rewarded for trying to get attention and resources on this. It was a big part of what drove me to leave Amazon. I'm still sitting on a big pile of zero-day vulnerabilities in ______ and ______.)
- Cloud services in most of the world are basically a 3-way oligopoly between AWS, Microsoft/Azure, and Google. The costs of switching from one provider to another are often ENORMOUS due to a zillion fiddly little differences and behavior quirks ("bugs"). It's not apparent to laypeople ā or even to me ā that any of these providers are much more or less reliable than the others.
averageRoyalty|3 months ago
I suspect (although have not researched) that global traffic is up, by throughput but also by session count.
This contributes to a lot more awareness. Slack being down wasn't impactful when most tech companies didn't use Slack. An AWS outage was less relevant when the 10 apps (used to be websites) you use most didn't rely on a single AZ in AWS or you were on your phone less.
I think as a society it just has more impact than it used to.
swed420|3 months ago
Among other mentioned factors like AI and layoffs: mass brain damage caused by never-ending COVID re-infections.
Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.
There are tons of studies to back this line of reasoning.
__MatrixMan__|3 months ago
xmprt|3 months ago
A lot of people are pointing to AI vibe coding as the cause, but I think more often than not, incidents happen due to poor maintenance of legacy code. But I guess this may be changing soon as AI written code starts to become "legacy" faster than regular code.
Kostic|3 months ago