top | item 46168481

(no title)

mixedbit | 2 months ago

This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.

discuss

tobyjsullivan|2 months ago

I’m not sure I share this sentiment.

First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.

As to architecture:

Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

But there’s a more interesting argument in favour of the status quo.

Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.

It might not be intuitive but think about it.

How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).

If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?

And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.

ccakes|2 months ago

> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

The point is that it doesn’t matter. A single site going down has a very small chance of impacting a large number of users. Cloudflare going down breaks an appreciable portion of the internet.

If Jim’s Big Blog only maintains 95% uptime, most people won’t care. If BofA were at 95%.. actually same. Most of the world aren’t BofA customers.

If Cloudflare is at 99.95% then the world suffers

hectormalot|2 months ago

> On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

I think the parent post made a different argument:

- Centralizing most of the dependency on Cloudflare results in a major outage when something happens at Cloudflare, it is fragile because Cloudflare becomes the single point of failure. Like: Oh Cloudflare is down... oh, none of my SaaS services work anymore.

- In a world where this is not the case, we might see more outages, but they would be smaller and more contained. Like: oh, Figma is down? fine, let me pickup another task and come back to Figma once it's back up. It's also easier to work around by having alternative providers as a fallback, as they are less likely to share the same failure point.

As a result, I don't think you'll be blocked 100 hours a year in scenario 2. You may observe 100 non-blocking inconveniences per year, vs a completely blocking Cloudflare outage.

And in observed uptime, I'm not even sure these providers ever won. We're running all our auxiliary services on a decent Hetzner box with a LB. Say what you want, but that uptime is looking pretty good compared to any services relying on AWS (Oct 20, 15 hours), Cloudflare (Dec 5 (half hour), Nov 18 (3 hours)). Easier to reason about as well. Our clients are much more forgiving when we go down due to Azure/GCP/AWS/Cloudflare vs our own setup though...

dfex|2 months ago

> If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year. > On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

Putting Cloudflare in front of a site doesn't mean that site's backend suddenly never goes down. Availability will now be worse - you'll have Cloudflare outages* affecting all the sites they proxy for, along with individual site back-end failures which will of course still happen.

* which are still pretty rare

randmeerkat|2 months ago

> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

I’m tired of this sentiment. Imagine if people said, why develop your own cloud offering? Can you really do better than VMWare..?

Innovation in technology has only happened because people dared to do better, rather than giving up before they started…

fallous|2 months ago

"My architecture depends upon a single point of failure" is a great way to get laughed out of a design meeting. Outsourcing that single point of failure doesn't cure my design of that flaw, especially when that architecture's intended use-case is to provide redundancy and fault-tolerance.

The problem with pursuing efficiency as the primary value prop is that you will necessarily end up with a brittle result.

kjgkjhfkjf|2 months ago

That's an interesting point, but in many (most?) cases productivity doesn't depend on all services being available at the same time. If one service goes down, you can usually be productive by using an alternative (e.g. if HN is down you go to Reddit, if email isn't working you catch up with Slack).

geysersam|2 months ago

On the other hand, if one site is down you might have alternatives. Or, you can do something different until the site you needed is up again. Your argument that simultaneous downtime is more efficient than uncoordinated downtime because tasks usually rely on multiple sites being online simultaneously is an interesting one. Whether or not that's true is an empirical question, but I lean toward thinking it's not true. Things failing simultaneously tends to have worse consequences.

nialse|2 months ago

Paraphrasing: We are setting aside the actual issue and looking for a different angle.

To me this reads as a form of misdirection, intentional or not. A monopolist has little reason to care about downstream effects, since customers have nowhere else to turn. Framing this as roll your own versus Cloudflare rather than as a monoculture CDN environment versus a diverse CDN ecosystem feels off.

That said, the core problem is not the monopoly itself but its enablers, the collective impulse to align with whatever the group is already doing, the desire to belong and appear to act the "right way", meaning in the way everyone else behaves. There are a gazillion ways of doing CDN, why are we not doing them? Why the focus on one single dominant player?

wat10000|2 months ago

That’s fine if it’s just some random office workers. What if every airline goes down at the same time because they all rely on the same backend providers? What if every power generator shuts off? “Everything goes down simultaneously” is not, in general, something to aim for.

atmosx|2 months ago

CloudFlare doesn’t have a good track record. It’s the third party that caused more outages for us than any other third party service in the last four years.

Nextgrid|2 months ago

> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

I disagree; most people need only a subset of Cloudflare's features. Operating just that subset avoids the risk of the other moving parts (that you don't need anyway) ruining your day.

Cloudflare is also a business and has its own priorities like releasing new features; this is detrimental to you because you won't benefit from said feature if you don't need it, yet still incur the risk of the deployment going wrong like we saw today. Operating your own stack would minimize such changes and allow you to schedule them to a maintenance window to limit the impact should it go wrong.

The only feature Cloudflare (or its competitors) offers that can't be done cost-effectively yourself is volumetric DDoS protection where an attacker just fills your pipe with junk traffic - there's no way out of this beyond just having a bigger pipe, which isn't reasonable for any business short of an ISP or infrastructure provider.

smsm42|2 months ago

> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

That's a wrong way of looking at it though. For 99.99% individual sites, I wouldn't care if they were down for weeks. Even if I use this site, there are very few sites that I need to use daily. For the rest of them, if one of them randomly goes down I probably would never know or notice, because I didn't need it then. However, when single-point-of-failure provider, like Cloudflare, goes down, you bet I notice. I must notice, because my work would be affected, my CI/CD pipelines will start failing, my newsfeeds will stop, I will notice it in dozens of places - because everybody uses it. The aggregated fails-per-unit-of-time may be less but the impact of each fail is way, way more, and the probability of it impacting me is approaching certainty.

So for me, as an average internet user, it would be much better if all the world wouldn't go down at once, even if the instances of particular things going down would be more frequent - provided they are randomly distributed in time and not concentrated. If just one thing goes down, I could do another thing. If everything goes down, I can only sit and twiddle my thumbs until it's back up.

sunrunner|2 months ago

> If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

This doesn’t guarantee availability of those N services themselves though, surely? N services with a slightly lower availability target than N+1 with a slightly higher value?

More importantly, I’d say that this only works for non-critical infrastructure, and also assumes that the cost of bringing that same infrastructure back is constant or at least linear or less.

The 2025 Iberian Peninsula outage seems to show that’s not always the case.

clickety_clack|2 months ago

If you’re using 10 services and 1 goes down, there’s a 9/10 chance you’re not using it and you can switch to work on something else. If all 10 go down you are actually blocked for an hour. Even 5 years ago, I can’t recall ever being actually impacted by an outtage to the extent that I was like “well, might as well just go get something to eat because everything is down”.

lxgr|2 months ago

> If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

The consequence of some services being offline is much, much worse than a person (or a billion) being bored in front of a screen.

Sure, it’s arguably not Cloudflares fault that these services are cloud-dependent in the first place, but even if service just degrades somewhat gracefully in an ideal case, that’s a lot of global clustering of a lot of exceptional system behavior.

Or another analogy: Every person probably passes out for a few minutes in their live at one point or another. Yet I wouldn’t want to imagine what happens if everybody got that over with at the very same time without warning…

tonyhb|2 months ago

Cloudbleed. It’s been a fun time.

embedding-shape|2 months ago

> Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

Why is that the only option? Cloudflare could offer solutions that let people run their software themselves, after paying some license fee. Or there could be many companies people use instead, instead of everyone flocking to one because of cargoculting "You need a CDN like Cloudflare before you launch your startup bro".

chamomeal|2 months ago

When I’m working from home and the internet goes down, I don’t care. My poor private-equity owned corporation, think of the lost productivity!!

But if I was trying to buy insulin at 11 pm before benefits expire, or translate something at a busy train station in a foreign country, or submit my take-home exam, I would be freeeaaaking out.

The cloudflare-supported internet does a whole lot of important, time-critical stuff.

gerdesj|2 months ago

All of my company's hosted web sites have way better uptimes and availability than CF but we are utterly tiny in comparison.

With only some mild blushing, you could describe us as "artisanal" compared to the industrial monstrosities, such as Cloudflare.

Time and time again we get these sorts of issues with the massive cloudy chonks and they are largely due to the sort of tribalism that used to be enshrined in the phrase: "no one ever got fired for buying IBM".

We see the dash to the cloud and the shoddy state of in house corporate IT as a result. "We don't need in-house knowledge, we have "MS copilot 365 office thing" that looks after itself and now its intelligent - yay \o/

Until I can't, I'm keeping it as artisanal as I can for me and my customers.

WD-42|2 months ago

In other words, the consolidation on Cloudflare and AWS makes the web less stable. I agree.

amazingman|2 months ago

Usually I am allergic to pithy, vaguely dogmatic summaries like this but you're right. We have traded "some sites are down some of the time" for "most sites are down some of the time". Sure the "some" is eliding an order of magnitude or two, but this framing remains directionally correct.

3rodents|2 months ago

Would you rather be attacked by 1,000 wasps or 1 dog? A thousand paper cuts or one light stabbing? Global outages are bad but the choice isn’t global pain vs local pleasure. Local and global both bring pain, with different, complicated tradeoffs.

Cloudflare is down and hundreds of well paid engineers spring into action to resolve the issue. Your server goes down and you can’t get ahold of your Server Person because they’re at a cabin deep in the woods.

Lamprey|2 months ago

It's not "1,000 wasps or 1 dog", it's "1,000 dogs at once, or "1 dog at once, 1,000 different times". Rare but huge and coordinated siege, or a steady and predictable background radiation of small issues.

The latter is easier to handle, easier to fix, and much more suvivable if you do fuck it up a bit. It gives you some leeway to learn from mistakes.

If you make a mistake during the 1000 dog siege, or if you don't have enough guards on standby and ready to go just in case of this rare event, you're just cooked.

psunavy03|2 months ago

If you've allowed your Server Person to be a single point of failure out innawoods, that's an organizational problem, not a technological one.

Two is one and one is none.

gblargg|2 months ago

Why would there be a centralized outage of decentralized services? The proper comparison seems to be attacked by a dog or a single wasp.

jchw|2 months ago

In most cases we actually get both local and global pain, since most people are running servers behind Cloudflare.

delusional|2 months ago

What you've identified here is a core part of what the banking sector calls the "risk based approach". Risk in that case is defined as the product of the chance of something happening and the impact of it happening. With this understanding we can make the same argument you're making, a little more clearly.

Cloudflare is really good at what they do, they employ good engineering talent, and they understand the problem. That lowers the chance of anything bad happening. On the other hand, they achieve that by unifying the infrastructure for a large part of the internet, raising the impact.

The website operator herself might be worse at implementing and maintaining the system, which would raise the chance of an outage. Conversely, it would also only affect her website, lowering the impact.

I don't think there's anything to dispute in that description. The discussion then is if cloudflares good engineering lowers the chance of an outage happening more than it raises the impact. In other words, the things we can disagree about is the scaling factors, the core of the argument seems reasonable to me.

JumpCrisscross|2 months ago

> Homogeneous systems like Cloudflare will continue to cause global outages

But the distributed system is vulnerable to DDOS.

Is there an architecture that maintains the advantages of both systems? (Distributed resilience with a high-volume failsafe.)

NicoJuicy|2 months ago

You should really check Cloudflare.

There is not a single company that makes their infrastructure as globally available like Cloudflare.

Additionally, the downtime of Cloudflare seems to be objectively less than the others.

Now, it took 25 minutes for 28% of the network.

While being the only ones to fix a global vulnerability.

There is a reason other clouds wouldn't touch the responsiveness and innovation that Cloudflare brings.

ivanjermakov|2 months ago

Robust architecture that is serving 80M requests/second worldwide?

My answer would be that no one product should get this big.

rekrsiv|2 months ago

On the other hand, as long as the entire internet goes down when Cloudflare goes down, I'll be able to host everything there without ever getting flack from anyone.

johncolanduoni|2 months ago

Actually, maybe 1 hour downtime for ~ the whole internet every month is a public good provided by Cloudflare. For everyone that doesn’t get paged, that is.

terminalshort|2 months ago

It's not as simple as that. What will result in more downtime, dependency on a single centralized service or not being behind Cloudflare? Clearly it's the latter or companies wouldn't be behind Cloudflare. Sure, the outages are more widespread now than they used to be, but for any given service the total downtime is typically much lower than before centralization towards major cloud providers and CDNs.

Klonoar|2 months ago

> Rust won't help, people will always make mistakes, also in Rust.

They don't just use Rust for "protection", they use it first and foremost for performance. They have ballpark-to-matching C++ performance with a realistic ability to avoid a myriad of default bugs. This isn't new.

You're playing armchair quarterback with nothing to really offer.

UltraSane|2 months ago

They badly need smaller blast radius and to use more chaos engineering tools.

cbsmith|2 months ago

I find this sentiment amusing when I consider the vast outages of the "good ol' days".

What's changed is a) our second-by-second dependency on the Internet and b) news/coverage.

theoldgreybeard|2 months ago

Notwithstanding that most people using Cloudflare aren't even benefiting from what it actually provides. They just use it...because reasons.

lxgr|2 months ago

Not too long ago, critical avionics were programmed by different software developers and the software was run on different hardware architectures, produced by different manufacturers. These heterogeneous systems produced combined control outputs via a quorum architecture – all in a single airplane.

Now half of the global economy seems to run on same service provider, it seems…

psychoslave|2 months ago

That's a reflect of social organisation. Pushing for hierarchical organisation with a few key centralising nodes will also impact business and technological decisions.

steelblueskies|2 months ago

Reductionist, but it's a backup problem.

Data matters? Have multiple copies, not all in the same place.

This is really no different, yet we don't have those redundancies in play.

Host, and paths.

Every other take is ultimately just shuffling justification around the least bad for everyone lack of backups for cost saving.

m00dy|2 months ago

Obviously Rust is the answer to these kind of problems. But if you are cloudflare and have an important company at a global scale, you need to set high standarts for your rust code. Developers should dance and celebrate end of the day if their code compiles in rust.

chickensong|2 months ago

You're not wrong, but where's the robust architecture you're referring to? The reality of providing reliable services on the internet is far beyond the capabilities of most organizations.

coderjames|2 months ago

I think it might be a organizational architecture that needs to change.

> However, we have never before applied a killswitch to a rule with an action of “execute”.

> This is a straightforward error in the code, which had existed undetected for many years

So they shipped an untested configuration change that triggered untested code straight to production. This is "tell me you have no tests without telling me you have no tests" level of facepalm. I work on safety-critical software where if we had this type of quality escape both internal auditors and external regulators would be breathing down our necks wondering how our engineering process failed and let this through. They need to rearchitect their org to put greater emphasis on verification and software quality assurance.

jonhess|2 months ago

Yeah, redundancy and efficiency are opposites. As engineers, we always chase efficiency, but resilience and redundancy are related.

rossjudson|2 months ago

You have a heterogeneous, fault-free architecture for the Cloudflare problem set? Interesting! Tell us more.

cyanydeez|2 months ago

Bro, but how do we make shareholder value if we don't monopolize and enshittify everything