How did we get to a place where either Cloudflare or AWS having an outage means a large part of the web going down? This centralization is very worrying.
Oddly this centralization allows a complete deferral of blame without you even doing anything: if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.
I’m not saying this is a good thing but I’m simply being realistic about why we ended up where we are.
As a user I do care, because I waste so much time on Cloudflare's "prove you are human" blocking-page (why do I have to prove it over and over again?), and frequently run on websites blocking me entirely based on some bad IP-blacklist used along with Cloudflare.
This is essentially the entire IT excuse for going to anything cloud. I see IT engineers all the time justifying that the downtime stops being their problem and they stop being to blame for it. There's zero personal responsibility in trying to preserve service, because it isn't "their problem" anymore. Anyone who thinks the cloud makes service more reliable is absolutely kidding themselves, because everyone who made the decision to go that way already knows it isn't true, it just won't be their problem to fix it.
If anyone in the industry actually cared about reliability and took personal stake in their system being up, everyone would be back on-prem.
> But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
In my direct experience, this isn't true if you're running something even vaguely mission-critical for your customers. Your customer's workers just know that they can't do their job for the day, and your customer's management just knows that the solution they shepherded through their organization is failing.
100% this. While in my professional capacity I'm all in for reliability and redundancy, as an individual, I quite like these situations when it's obvious that I won't be getting any work done and it's out of my control, so I can go run some errands to or read a book, or just finish early.
Which "user" are you referring to? Cloudflare users or end product users?
End product users have no power, they can complain to support and maybe get a free month of service, but the 0.1% of customers that do that aren't going to turn the tide and have anything change.
Engineering teams using these services also get "covered" by them - they can finger point and say "everyone else was down too."
Eh? It's because they are offering a service too good to refuse.
The internet this day is fucking dangerous and murderous as hell. We need Cloudflare just to keep services up due to the deluge of AI data scrapers and other garbage.
More like "don't have choice". It's not like service provider gonna go to competition, because before you switch, it will be back.
Frankly it's a blessing, always being able to blame the cloud that management forced company to migrate to be "cheaper" (which half of the time turns out to be false anyway)
> It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.
I agree. When people talk about the enshittification of the internet, Cloudflare plays a significant role.
Many reasons but DDoS protection has massive network effects. The more customers you have (and therefore bandwidth provision) the easier it is to hold up against a DDoS, as DDoS are targeting just one (usually) customer.
So there are massive economies of scale. Small CDN with (say) 10,000 customers and 10mbit/sec per customer can handle 100gbit/s DDoS (way too simplistic, but hopefully you get the idea) - way too small.
If you have the same traffic provisioned on average per customer and have 1 million customers, you can handle a DDoS 100x the size.
Only way to compete with this is to massively overprovision bandwidth per customer (which is expensive, as those customers won't pay more just for you to have more redundancy because you are smaller).
In a way (like many things in infrastructure) CDNs are natural monopolies. The bigger you get -> the more bandwidth and PoP you can have -> more attractive to more customers (this repeats over and over).
It was probably very astute of Cloudflare to realise that offering such a generous free plan was a key step in this.
In a CDN, customers consume bandwidth; they do not contribute it. If Cloudflare adds 1 million free customers, they do not magically acquire 1 million extra pipes to the internet backbone. They acquire 1 million new liabilities that require more infrastructure investment.
All you are doing is echoing their pitch book. Of course they want to skim their share of the pie.
In my opinion, DDoS is possible only because there is no network protocol for a host to control traffic filtering on upstream providers (deny traffic from certain subnets or countries). In this case everybody would prefer write their own systems rather than rely on a harmful monopoly.
Yeah, I went to HN after the third web page didn't work. I am not just worried about the single point of failure, I am much more worried about this centralization eventually shaping the future standards of the web and making it de facto impossible to self-host anything.
Well that and the fact that when 99% goes through a central party, then that central party will be very interesting for authoritarian governments to apply sweeping censorship rules to.
It is already nearly impossible/very expensive in my country to be able to get a public IP address (Even IPv6) which you could host on. World is heavily moving towards centrally dependant on these big Cloud providers.
It is not as bad as Cloudflare or AWS because certificates will not expire the instant there is an outage, but considers that:
- It serves about 2/3 of all websites
- TLS is becoming more and more critical over time. If certificates fail, the web may as well be down
- Certificate lifetimes are becoming shorter and shorter, now 90 days, but Let's Encrypt is now considering 6 days, with 47 days being planned as a minimum
- An outage is one thing, but should a compromise happen, that would be even more catastrophic
Let's Encrypt is a good guy now, but remember that Google used to be a good guy in the 2000s too!
(Disclaimer: I am tech lead of Let's Encrypt software engineering)
I'm also concerned about LE being a single point of failure for the internet! I really wish there were other free and open CAs out there. Our goal is to encrypt the web, not to perpetuate ourselves.
That said, I'm not sure the line of reasoning here really holds up? There's a big difference between this three-hour outage and the multi-day outage that would be necessary to prevent certificate renewal, even with 6-day certs. And there's an even bigger difference between this sort of network disruption and the kind of compromise that would be necessary to take LE out permanently.
So while yes, I share your fear about the internet-wide impact of total Let's Encrypt collapse, I don't think that these situations are particularly analogous.
Agree, I’ve thought about this one too. The history of SSL/TLS certs is pretty hacky anyway in my opinion. The main problem they are solving really should have been solved at the network layer with ubiquitous IPsec and key distribution via DNS since most users just blindly trust whatever root CAs ship with their browser or OS, and the ecosystem has been full of implementation and operational issues.
Let’s Encrypt is great at making the existing system less painful, and there are a few alternatives like ZeroSSL, but all of this automation is basically a pile of workarounds on top of a fundamentally inappropriate design.
Mostly since the AWS craze started a decade ago, developers have gone away from Dedicated servers (which are actually cheaper, go figure), which is causing all this mess.
It's genuinely insane that many companies are designing a great amount of fallbacks... on the software level but almost none is thought on the hardware/infrastructure level, common-sense dictate that you should never host everything on a single provider.
I tried as hard as I could to stay self hosted (and my backend is, still), but getting constant DDoS attacks and not having the time to deal with fighting them 2-3x a month was what ultimately forced me to Cloudflare. It's still worse than before even with their layers of protection, and now I get to watch my site be down a while, with no ability to switch DNS to point back to my own proxy layer, since CF is down :/
With the state of constant attack from AI scrapers and DDOS bots, you pretty much need to have a CDN from someone now, if you have a serious business service. The poor guys with single prem boxes with static HTML can /maybe/ weather some of this storm alone but not everything.
I self hosted on one of the company’s servers back in the late 90s. Hard drive crashes (and a hack once, through an Apache bug) had our services (http, pop, smtp, nfs, smb, etc ) down for at least 2-3 days (full reinstall, reconfiguration, etc).
Then, with regular VPSs I also had systems down for 1-2 days. Just last week the company that hosts NextCloud for us was down the whole weekend (from Friday evening) and we couldn’t get their attention until Monday.
So far these huge outages that last 2-5 hours are still lower impact for me, and require me to take less action.
I like the idea of having my own rack in a data center somewhere (or sharing the rack, whatever) but even a tiny cost is still more than free. And even then, that data center will also have outages, with none of the benefits of a Cloudflare Pages, GitHub Pages, etc.
> developers have gone away from Dedicated servers (which are actually cheaper, go figure)
It depends on how you calculate your cost. If you only include the physical infrastructure having a dedicated server is cheaper. But by having some dedicated server you loose a lot of flexibility. Needs more resources? Just scale up your ec2, and with a dedicated server there is a lot more work involved.
Do you want a 'production-ready' database? With AWS you can just click a few buttons and have a RDS ready to use. To roll out your own PG installation you need someone with a lot of knowledge(how to configure replication? backups? updates? ...).
So if you include salaries in the calculation the result changes a lot. And even if you already have some experts in your payroll by putting them to work in deploying a PG instance you won't be able to use them to build other things that may generate more value to you business than the premium you pay to AWS.
Cloud-Hoster are that hardware-fallback. They started with offering better redundancy and scaling than your homemade breadbox. But it seems they lost something along the way and now we have this.
Maintainance cost is the main issue for on-prem infra, nowadays add things like DDOS protection and/or scraping protection, which can require dedicated team or for your company to rely on some library or open source project that is not guaranteed to be maintained forever (unless you give them support, which i believe in)... Yeah I can understand why companies shift off of on-prem nowadays
... dedis are cheaper if you are rightsized. If you are wrongsize they just plain crash and you may or may not be able to afford the upgrade.
I was at Softlayer before I was at AWS and what catalyzed the move was the time I needed to add another hard drive to a system and somehow they screwed it up. I couldn't put a trouble ticket it to get it fixed because my database record in their trouble ticket system was corrupted. The next day I moved my stuff to AWS and the day after that they had a top sales guy talk to me to try to get me to stay but it was too late.
You jest, but this actually does exist. Multiple CDNs sell multi-CDN load balancing (divide traffic between 2+ CDNs per variously-complicated specifications, with failover) as a value add feature, and IIRC there is at least one company for which this is the marquee feature. It's also relatively doable in-house as these things go.
This might sound crazy as a software engineer, but I actually like the occasional "snow day" where everything goes down. It's healthy for us to all disconnect from the internet for a bit. The centralization unintentionally helps facilitate that. At least, that's my glass half full perspective.
I can understand that sentiment. Just don't lose sight of the impact it can have on every day people. My wife and I own a small theatre and we sell tickets through Eventbrite. It's not my full time job but it is hers. Eventbrite sent out an email this morning letting us know that they are impacted by the outage. Our event page appears to be working but I do wonder if it's impacting ticket sales for this weekend's shows.
So while us in tech might like a "snow day", there are millions of small businesses and people trying to go about their day to day lives who get cut off because of someone else's fuck-ups when this happens.
> This might sound crazy as a software engineer, but I actually like the occasional "snow day" where everything goes down
As as software engineer, I get it. as a CTO, I spent this morning triaging with my devops ai(actual Indian) to find some workaround (we found one) while our CEO was doing damage control with customers (non technical field) who were angry that we were down and they were losing business by the minute.
sometimes I miss not having a direct stake in the success of the business.
I'm guessing you're employed and your salary is guaranteed regardless. Would you have the same outlook if you were the self-employed founder of an online business and every minute of outage was costing you money?
except, yknow, where peoples lives and livelihoods depend on access to information/being able to do things on exact time. aws and cloudflare are disqualifying themselves from hospitals and military and whatnot.
How did we get to a place where Cloudflare being down means we see an outage page, but on that page it tells us explicitly that the host we're trying to connect to is up, and it's just a Cloudflare problem.
If it can tell us that the host is up, surely it can just bypass itself to route traffic.
People use CloudFlare because it's a "free" way for most sites to not get exploited (WAF) or DDoSed (CDN/proxy) regularly. A DDoS can cost quite a bit more than a day of downtime, even just a thundering herd of legitimate users can explode an egress bill.
It sucks there's not more competition in this space but CloudFlare isn't widely used for no reason.
AWS also solves real problems people have. Maintaining infrastructure is expensive as is hardware service and maintenance. Redundancy is even harder and more expensive. You can run a fairly inexpensive and performant system on AWS for years for the cost of a single co-located server.
It's not only centralization in the sense your website will be down if they are down but it is also a centralized MITM proxy. If you transfer sensitive data like chats over cloudflare-"protected" endpoints, you also allow CF to transparently read and analyze it in plain-text. It must be very easy for state agencies to spy on the internet nowadays, they woukd just ask CF to redirect traffic to them.
Because it's better to have a really convenient and cheap service that works 99% of the time, than a resilient that is more expensive or more cumbersome to use.
It's like github vs whatever else you can do with git that is truly decentralized. The centralization has such massive benefits that I'm very happy to pay the price of "when it's down I can't work".
Most developers don't care to know how the underlying infrastructure works (or why) and so they take whatever the public consensus is re: infra as a statement of fact (for the better part of the last 15 years or so that was "just use the cloud"). A shocking amount of technical decisions are socially, not technically enforced.
This topic is raised every time there is an outage with cloudflare and the truth of the matter is, they offer an incredible service, there is not a bit enough competition to deal with it. By definition their services are so good BECAUSE their adoption rate is so high.
It's very frustrating of course, and it's the nature of the beast.
Compliance. If you wanna sell your SAAS to big corpo, their compliance teams will feel you know what you're doing if they read AWS or Cloudflare on your architecture, even if you do not quite know what you're doing.
Because DDoS is a fact of life (and even if you aren't targeted by DDoS, the bot traffic probing you to see if you can be made part of the botnet is enough to take down a cheap $5 VPS). So we have to ask - why? Personally, I don't accept the hand-wavy explanation that botnets are "just a bunch of hacked IoT devices". No, your smart lightbulb isn't taking down Reddit. I slightly believe the secondary explanation that it's a bunch of hacked home routers. We know that home routers are full of things like suspicious oopsie definitely-not-government backdoors.
IMO, centralization is inevitable because the fundamental forces drive things in that direction. Clouds are useful for a variety of reasons (technical, time to market, economic), so developers want to use them. But clouds are expensive to build and operate, so there are only a few organizations with the budget and competency to do it well. So, as the market matures you end up with 3 to 5 major cloud operators per region, with another handful of smaller specialists. And that’s just the way it works. Fighting against that is to completely swim upstream with every market force in opposition.
There is this tendency to phrase questions (or statements) as
"when did 'we' ".
These decision are made individually not centrally. There is no process in place (and most likely there will never be) that will be able to control and dictate if people decide one way of doing things is the best way to do it. Even assuming they understand everything or know of the pitfalls.
Even if you can control individually what you do for the site you operate (or are involved in) you won't have any control on parts of your site (or business) that you rely on where others use AWS or Cloudflare.
I would be less worried if Cloudflare and AWS weren't involved in many more things than simply running DNS.
AWS - someone touches DynamoDB and it kills the DNS.
Cloudflare - someone touches functionality completely unrelated to DNS hosting and proxying and, naturally, it kills the DNS.
There is this critical infrastructure that just becomes one small part of a wider product offering, worked on by many hands, and this critical infrastructure gets taken down by what is essentially a side-effect.
It's a strong argument to move to providers that just do one thing and do it well.
Re: Cloudflare it is because developers actively pushed "just use Cloudflare" again and again and again.
It has been dead to me since the SSL cache vulnerability thing and the arrogance with which senior people expected others to solve their problems.
But consider how many people still do stupid things like use the default CDN offered by some third party library, or use google fonts directly; people are lazy and don't care.
We take the idea of the internet always being on for granted. Most people don’t understand the stack and assume that when sites go down it’s isolated, and although I agree with you, it’s just as much complacency and lack of oversight and enforcement delays in bureaucracy as it is centralization. But I guess that’s kind of the umbrella to those things… lol
Well the centralisation without rapid recovery and practices that provide substantial resiliency… that would be worrying.
But I dare say the folks at these organisations take these matters incredibly seriously and the centralisation problem is largely one of risk efficiency.
I think there is no excuse, however, to not have multi region on state, and pilot light architectures just in case.
A lot (and I mean a lot) of people in IT like centralization specifically because it’s hard to blame people for doing something that everyone else is doing.
This was always the case. There was always a "us-east" in some capacity, under Equinix, etc. Except it used to be the only "zone," which is why the internet is still so brittle despite having multiple zones. People need to build out support for different zones. Old habits die hard, I guess.
> How did we get to a place where either Cloudflare or AWS having an outage means a large part of the web going down?
As always, in the name of "security". When are we going to learn that anything done, either by the government or by a corporation, in the name of security is always bad for the average person?
It's weird to think about so bear with me. I don't mean this sardonically or misanthropically. But, it's "just the internet." It's just the internet. It dones't REALLY matter in a large enough macro view. It's JUST the internet.
It's because single points of traffic concentration are the most surveillable architecture, so FVEY et al economically reward with one hand those companies who would build the architecture they want to surveil with the other hand.
Currently at the public library and I can't use the customer inventory terminals to search for books. They're just a web browser interface to the public facing website, and it's hosted behind CF. Bananas.
Don't forget the CloudStrike outage: One company had a bug that brought down almost everything. Who would have thought there are so many single points of failure across the entire Internet.
For most services it's safer to host from behind Cloudflare, and Cloudflare is considered more highly available than a single IaaS or PaaS, at least in my headcanon.
The same reason we have centralization across the economy. Economies of scale is how you make a big business succesful, and once you are on top its hard to dislodge you.
Agreed. More worrying is that it appears standard practice or separation between domain and nameserver administration has been lost to one-stop-shop marketing.
And all of these outages happening not long after most of them dismissed a large amount of experienced staff while moving jobs offshore to save in labor costs.
Short-term economic forces, probably. Centralization is often cheaper in the near term. The cost of designing in single-point failure modes gets paid later.
I think some of the issues in the last outage actually affected multiple regions. IIRC internally some critical infrastructure for AWS depends on us-east-1 or at least it failed in a way that didn't allow failover.
5 mins. of thought to figure out why these services exist?
Dialogue about mitigations/solutions? Alternative services? High availability strategies?
Nah! It's free to complain.
Me personally, I'd say those companies do a phenomenal job by being a de facto backbone of the modern web. Also Cloudflare, in particular, gives me a lot of things for free.
It's not really. People are just very bad at putting the things around them into perspective.
Your power is provided by a power utility company. They usually serve an entire state, if not more than one (there are smaller ones too). That's "centralization" in that it's one company, and if they "go down", so do a lot of businesses. But actually it's not "centralized", in that 1) there are actually many different companies across the country/world, and 2) each company "decentralizes" most of its infrastructure to prevent massive outages.
And yes, power utilities have outages. But usually they are limited in scope and short-lived. They're so limited that most people don't notice when they happen, unless it's a giant weather system. Then if it's a (rare) large enough impact, people will say "we need to reform the power grid!". But later when they've calmed down, they realize that would be difficult to do without making things worse, and this event isn't common.
Large internet service providers like AWS, Cloudflare, etc, are basically internet utilities. Yes they are large, like power utilities. Yes they have outages, like power utilities. But the fact that a lot of the country uses them, isn't any worse than a lot of the country using a particular power company. And unlike the power companies, we're not really that dependent on internet service providers. You can't really change your power company; you can change an internet service provider.
Power didn't used to be as reliable as it is. Everything we have is incredibly new and modern. And as time has passed, we have learned how to deal with failures. Safety and reliability has increased throughout critical industries as we have learned to adapt to failures. But that doesn't mean there won't be failures, or that we can avoid them all.
We also have the freedom to architect our technology to work around outages. All the outages you have heard about recently could be worked around, if the people who built on them had tried:
- CDN goes down? Most people don't absolutely need a CDN. Point your DNS at your origins until the CDN comes back. (And obviously, your DNS provider shouldn't be the same as your CDN...)
- The control plane goes down on dynamic cloud APIs? Enable a "limp mode" that persists existing infrastructure to serve your core needs. You should be able to service most (if not all) of your business needs without constantly calling a control plane.
- An AZ or region goes down? Use your disaster recovery plan: deploy infrastructure-as-code into another region or AZ. Destroy it when the az/region comes back.
...and all of that just to avoid a few hours of downtime per year? It's likely cheaper to just take the downtime. But that doesn't stop people from piling on when things go wrong, questioning whether the existence of a utility is a good idea.
afavour|3 months ago
Oddly this centralization allows a complete deferral of blame without you even doing anything: if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.
I’m not saying this is a good thing but I’m simply being realistic about why we ended up where we are.
marticode|3 months ago
ocdtrekkie|3 months ago
If anyone in the industry actually cared about reliability and took personal stake in their system being up, everyone would be back on-prem.
tjoff|3 months ago
Users are never a consideration today anyway.
alentred|3 months ago
baxtr|3 months ago
As long as HN is up and running, everything is going to be O.K.!
thr0w|3 months ago
In my direct experience, this isn't true if you're running something even vaguely mission-critical for your customers. Your customer's workers just know that they can't do their job for the day, and your customer's management just knows that the solution they shepherded through their organization is failing.
BeFlatXIII|3 months ago
In this case, the internet should be down more often.
falcor84|3 months ago
tjwebbnorfolk|3 months ago
oh no
jclardy|3 months ago
End product users have no power, they can complain to support and maybe get a free month of service, but the 0.1% of customers that do that aren't going to turn the tide and have anything change.
Engineering teams using these services also get "covered" by them - they can finger point and say "everyone else was down too."
lxgr|3 months ago
pancsta|3 months ago
Which changes nothing to you actually being down, youre only down more. CF proxies always sucked - not your domain, not your domain...
timeon|3 months ago
This:
> if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
is just marketing. If you are down with some other websites it is still bad.
LtWorf|3 months ago
When have users been asked about anything?
ozgrakkurt|3 months ago
delfinom|3 months ago
The internet this day is fucking dangerous and murderous as hell. We need Cloudflare just to keep services up due to the deluge of AI data scrapers and other garbage.
mistrial9|3 months ago
this is like a bad motivational speaker talk.. heavy exhortations with a dramatic lack of actual reasoning.
Systems are difficult, people. It is "incentives" of parties and lockin by tech design and vendors, not lack of individual effort.
ge96|3 months ago
PunchyHamster|3 months ago
Frankly it's a blessing, always being able to blame the cloud that management forced company to migrate to be "cheaper" (which half of the time turns out to be false anyway)
Hrun0|3 months ago
I agree. When people talk about the enshittification of the internet, Cloudflare plays a significant role.
therealdkz|3 months ago
[deleted]
martinald|3 months ago
So there are massive economies of scale. Small CDN with (say) 10,000 customers and 10mbit/sec per customer can handle 100gbit/s DDoS (way too simplistic, but hopefully you get the idea) - way too small.
If you have the same traffic provisioned on average per customer and have 1 million customers, you can handle a DDoS 100x the size.
Only way to compete with this is to massively overprovision bandwidth per customer (which is expensive, as those customers won't pay more just for you to have more redundancy because you are smaller).
In a way (like many things in infrastructure) CDNs are natural monopolies. The bigger you get -> the more bandwidth and PoP you can have -> more attractive to more customers (this repeats over and over).
It was probably very astute of Cloudflare to realise that offering such a generous free plan was a key step in this.
kordlessagain|3 months ago
In a CDN, customers consume bandwidth; they do not contribute it. If Cloudflare adds 1 million free customers, they do not magically acquire 1 million extra pipes to the internet backbone. They acquire 1 million new liabilities that require more infrastructure investment.
All you are doing is echoing their pitch book. Of course they want to skim their share of the pie.
codedokode|3 months ago
karmelapple|3 months ago
Not every company can be an expert at everything.
But perhaps many of us could buy a different CDN than the major players if we want to reduce the likelihood of mass outages like this though.
ulrikrasmussen|3 months ago
Well that and the fact that when 99% goes through a central party, then that central party will be very interesting for authoritarian governments to apply sweeping censorship rules to.
sankalpmukim|3 months ago
popcorncowboy|3 months ago
Eventually?
GuB-42|3 months ago
It is not as bad as Cloudflare or AWS because certificates will not expire the instant there is an outage, but considers that:
- It serves about 2/3 of all websites
- TLS is becoming more and more critical over time. If certificates fail, the web may as well be down
- Certificate lifetimes are becoming shorter and shorter, now 90 days, but Let's Encrypt is now considering 6 days, with 47 days being planned as a minimum
- An outage is one thing, but should a compromise happen, that would be even more catastrophic
Let's Encrypt is a good guy now, but remember that Google used to be a good guy in the 2000s too!
phasmantistes|3 months ago
I'm also concerned about LE being a single point of failure for the internet! I really wish there were other free and open CAs out there. Our goal is to encrypt the web, not to perpetuate ourselves.
That said, I'm not sure the line of reasoning here really holds up? There's a big difference between this three-hour outage and the multi-day outage that would be necessary to prevent certificate renewal, even with 6-day certs. And there's an even bigger difference between this sort of network disruption and the kind of compromise that would be necessary to take LE out permanently.
So while yes, I share your fear about the internet-wide impact of total Let's Encrypt collapse, I don't think that these situations are particularly analogous.
seniorThrowaway|3 months ago
Let’s Encrypt is great at making the existing system less painful, and there are a few alternatives like ZeroSSL, but all of this automation is basically a pile of workarounds on top of a fundamentally inappropriate design.
b00ty4breakfast|3 months ago
pixel_popping|3 months ago
It's genuinely insane that many companies are designing a great amount of fallbacks... on the software level but almost none is thought on the hardware/infrastructure level, common-sense dictate that you should never host everything on a single provider.
geerlingguy|3 months ago
imglorp|3 months ago
elondaits|3 months ago
Then, with regular VPSs I also had systems down for 1-2 days. Just last week the company that hosts NextCloud for us was down the whole weekend (from Friday evening) and we couldn’t get their attention until Monday.
So far these huge outages that last 2-5 hours are still lower impact for me, and require me to take less action.
MattSayar|3 months ago
nzach|3 months ago
It depends on how you calculate your cost. If you only include the physical infrastructure having a dedicated server is cheaper. But by having some dedicated server you loose a lot of flexibility. Needs more resources? Just scale up your ec2, and with a dedicated server there is a lot more work involved.
Do you want a 'production-ready' database? With AWS you can just click a few buttons and have a RDS ready to use. To roll out your own PG installation you need someone with a lot of knowledge(how to configure replication? backups? updates? ...).
So if you include salaries in the calculation the result changes a lot. And even if you already have some experts in your payroll by putting them to work in deploying a PG instance you won't be able to use them to build other things that may generate more value to you business than the premium you pay to AWS.
slightwinder|3 months ago
powerpixel|3 months ago
PaulHoule|3 months ago
I was at Softlayer before I was at AWS and what catalyzed the move was the time I needed to add another hard drive to a system and somehow they screwed it up. I couldn't put a trouble ticket it to get it fixed because my database record in their trouble ticket system was corrupted. The next day I moved my stuff to AWS and the day after that they had a top sales guy talk to me to try to get me to stay but it was too late.
lforster|3 months ago
nexttk|3 months ago
"I added a load-balancer to improve system reliability" (happy)
"Load balancer crashed" (smiling-through-the-pain)
amalcon|3 months ago
kevin_thibedeau|3 months ago
MichaelZuo|3 months ago
sotix|3 months ago
gspencley|3 months ago
So while us in tech might like a "snow day", there are millions of small businesses and people trying to go about their day to day lives who get cut off because of someone else's fuck-ups when this happens.
cultofmetatron|3 months ago
As as software engineer, I get it. as a CTO, I spent this morning triaging with my devops ai(actual Indian) to find some workaround (we found one) while our CEO was doing damage control with customers (non technical field) who were angry that we were down and they were losing business by the minute.
sometimes I miss not having a direct stake in the success of the business.
hashim|3 months ago
ljm|3 months ago
But there are systems that depend on Cloudflare, directly or not, and when they go down it can have a serious impact on somebody's livelihood.
majani|3 months ago
amw-zero|3 months ago
swyx|3 months ago
mobiuscog|3 months ago
If it can tell us that the host is up, surely it can just bypass itself to route traffic.
ralferoo|3 months ago
Congratulations, you've successfully completed Management Training 101.
ec109685|3 months ago
lbreakjai|3 months ago
tacker2000|3 months ago
a012|3 months ago
kaonwarb|3 months ago
giantrobot|3 months ago
It sucks there's not more competition in this space but CloudFlare isn't widely used for no reason.
AWS also solves real problems people have. Maintaining infrastructure is expensive as is hardware service and maintenance. Redundancy is even harder and more expensive. You can run a fairly inexpensive and performant system on AWS for years for the cost of a single co-located server.
seydor|3 months ago
neop1x|3 months ago
alkonaut|3 months ago
It's like github vs whatever else you can do with git that is truly decentralized. The centralization has such massive benefits that I'm very happy to pay the price of "when it's down I can't work".
kahrl|3 months ago
Very worrying indeed.
rglover|3 months ago
bilekas|3 months ago
It's very frustrating of course, and it's the nature of the beast.
blazinglyfast|3 months ago
bikamonki|3 months ago
phendrenad2|3 months ago
drob518|3 months ago
gist|3 months ago
These decision are made individually not centrally. There is no process in place (and most likely there will never be) that will be able to control and dictate if people decide one way of doing things is the best way to do it. Even assuming they understand everything or know of the pitfalls.
Even if you can control individually what you do for the site you operate (or are involved in) you won't have any control on parts of your site (or business) that you rely on where others use AWS or Cloudflare.
ljm|3 months ago
AWS - someone touches DynamoDB and it kills the DNS.
Cloudflare - someone touches functionality completely unrelated to DNS hosting and proxying and, naturally, it kills the DNS.
There is this critical infrastructure that just becomes one small part of a wider product offering, worked on by many hands, and this critical infrastructure gets taken down by what is essentially a side-effect.
It's a strong argument to move to providers that just do one thing and do it well.
exasperaited|3 months ago
It has been dead to me since the SSL cache vulnerability thing and the arrogance with which senior people expected others to solve their problems.
But consider how many people still do stupid things like use the default CDN offered by some third party library, or use google fonts directly; people are lazy and don't care.
abtinf|3 months ago
telepromptereye|3 months ago
an-allen|3 months ago
But I dare say the folks at these organisations take these matters incredibly seriously and the centralisation problem is largely one of risk efficiency.
I think there is no excuse, however, to not have multi region on state, and pilot light architectures just in case.
cj|3 months ago
A lot (and I mean a lot) of people in IT like centralization specifically because it’s hard to blame people for doing something that everyone else is doing.
iso1631|3 months ago
I'd be horrified. That's not the internet or computing industries I grew up with, or started working in.
But as long as the SPY keeps hitting > 10% returns each year, everyone's happy.
chb|3 months ago
mvkel|3 months ago
bsoles|3 months ago
As always, in the name of "security". When are we going to learn that anything done, either by the government or by a corporation, in the name of security is always bad for the average person?
butlike|3 months ago
rcarmo|3 months ago
expedition32|3 months ago
hhthrowaway1230|3 months ago
People not being ready for cloudflare/[insert hyperscaler] to be possibly down is the only fault.
Lammy|3 months ago
kilpikaarna|3 months ago
gmiller123456|3 months ago
poemxo|3 months ago
bawolff|3 months ago
chasing0entropy|3 months ago
strict9|3 months ago
ridgeguy|3 months ago
kordlessagain|3 months ago
paulddraper|3 months ago
And it’s hard to protect against DDoS without something like Cloudflare.
Look at the posts here.
Even the meager HN “hug of death” will take things down
peacebeard|3 months ago
rtkwe|3 months ago
unknown|3 months ago
[deleted]
unknown|3 months ago
[deleted]
BurningFrog|3 months ago
nntwozz|3 months ago
ronald_petty|3 months ago
burnt-resistor|3 months ago
glitchc|3 months ago
ekianjo|3 months ago
k12sosse|3 months ago
baq|3 months ago
joeiq|3 months ago
whstl|3 months ago
Companies can always do as they please and people will rationalize anything.
moralestapia|3 months ago
Dialogue about mitigations/solutions? Alternative services? High availability strategies?
Nah! It's free to complain.
Me personally, I'd say those companies do a phenomenal job by being a de facto backbone of the modern web. Also Cloudflare, in particular, gives me a lot of things for free.
fithisux|3 months ago
The target these days is the user.
The make-believe worm.
thenthenthen|3 months ago
0xbadcafebee|3 months ago
Your power is provided by a power utility company. They usually serve an entire state, if not more than one (there are smaller ones too). That's "centralization" in that it's one company, and if they "go down", so do a lot of businesses. But actually it's not "centralized", in that 1) there are actually many different companies across the country/world, and 2) each company "decentralizes" most of its infrastructure to prevent massive outages.
And yes, power utilities have outages. But usually they are limited in scope and short-lived. They're so limited that most people don't notice when they happen, unless it's a giant weather system. Then if it's a (rare) large enough impact, people will say "we need to reform the power grid!". But later when they've calmed down, they realize that would be difficult to do without making things worse, and this event isn't common.
Large internet service providers like AWS, Cloudflare, etc, are basically internet utilities. Yes they are large, like power utilities. Yes they have outages, like power utilities. But the fact that a lot of the country uses them, isn't any worse than a lot of the country using a particular power company. And unlike the power companies, we're not really that dependent on internet service providers. You can't really change your power company; you can change an internet service provider.
Power didn't used to be as reliable as it is. Everything we have is incredibly new and modern. And as time has passed, we have learned how to deal with failures. Safety and reliability has increased throughout critical industries as we have learned to adapt to failures. But that doesn't mean there won't be failures, or that we can avoid them all.
We also have the freedom to architect our technology to work around outages. All the outages you have heard about recently could be worked around, if the people who built on them had tried:
- CDN goes down? Most people don't absolutely need a CDN. Point your DNS at your origins until the CDN comes back. (And obviously, your DNS provider shouldn't be the same as your CDN...)
- The control plane goes down on dynamic cloud APIs? Enable a "limp mode" that persists existing infrastructure to serve your core needs. You should be able to service most (if not all) of your business needs without constantly calling a control plane.
- An AZ or region goes down? Use your disaster recovery plan: deploy infrastructure-as-code into another region or AZ. Destroy it when the az/region comes back.
...and all of that just to avoid a few hours of downtime per year? It's likely cheaper to just take the downtime. But that doesn't stop people from piling on when things go wrong, questioning whether the existence of a utility is a good idea.
cyanydeez|3 months ago
Are people really this confused?
TacticalCoder|3 months ago
[deleted]