top | item 39957902

Glory is only 11MB/sec away (2023)

241 points| todsacerdoti | 1 year ago |thmsmlr.com

157 comments

order
[+] mattbee|1 year ago|reply
My little old hosting business lived and died on this particular hill, I just didn't realise what was happening at the time.

When we grew up in the early 2000s, our bigger sales usually featured complicated stacks. They had redundant load-balancing, redundant firewalls, way more than many customer never needed. (But they did ask for it). The failover often cost more in management complexity than it saved when a box died in the "right" way to trigger the planned failover event.

We sold ourselves on cleverness when people asked for that. It worked to grow our business to 30 staff, a data centre in our home city, life was good! We responded to AWS with an API-based cloud hosting platform. But sales still peaked in 2012.

Customers wanted even more complex solutions than the ones we were selling - partially or wholly based on AWS. But - we figured - the hardware we were buying was hugely powerful compared to 10 years previously, and sites weren't that much more complicated. The bigger customers would (surely!) want fewer, less complicated boxes as a result. Unfortunately that is not selling on cleverness, that is selling on price. We never understood the financial ambition needed for that pviot. Nobody trusted a single cheap server, and even if they bought two, where was the scalability? It worked enough to keep revenue flat, but we obviously couldn't compete on building managed service stacks and software ecosystems quicker than Amazon.

When the new technical challenges had long dried-up, we sold in 2018.

My thinking (and so most of the company's product design) came from being bootstrapped where the possibility of an uncapped hosting bill seemed like an insane risk to take. Who would take it? (wait - what - why was everyone taking it??!)

AWS are embedded not just because VC makes their high-priced products feasible, but because their particular brand of cleverness is embedded in a generation of software developers. It obviously works! But the knowledge of when you might not need their cloud (or what the alternatives could ever be) feels quite a niche thing now.

[+] andyjohnson0|1 year ago|reply
Interesting story. Thanks for posting.

Unfortunately AWS's complexity and cleverness is catnip for developers, and its support for résumé-driven development is second to none.

[+] bostik|1 year ago|reply
There are a couple of things wrong with the numbers.

1. Traffic is not evenly spread. The figures from the article (400M page loads per month) are subject to recursive 80/20 rule. 80% of the requests (320M) are served within 20% of the time (6 days). Within those high-volume days, 80% of the requests (256M) are served in 20% of the time (~29h). And if you're serving particularly spiky traffic patterns, then 80% of the those requests (~205M) come in during just 20% of the time window (5.8h) -- 5.8h is a little shy of 21000 seconds. 205M / 21k is about 9.7k requests per second.

That's still doable on a single system, but it's no longer trivial. Especially if you want to run off a single DB with no read replicas. And while the total amount of traffic served remains the same, the necessary bandwidth cap for peak loads gets far above the optimistically averaged 11MB/sec.

2. Unidirectional end-to-end latency only applies to streaming data. A cold start in the real world (no HTTP/3) requires first to establish the underlying TCP connection which is three trips, then the TLS connection which is a minimum of two more trips, and then you get to send the actual HTTP request... which still needs to send the response back.

If you want to serve real humans, everything observable has to happen in less than one second[0]. After that there's a steep drop-off as users assume your system is broken and just close the tab.

Disclosure: in previous life I helped to run a betting exchange. The traffic patterns are extremely spiky, latency requirements are demanding and trading volume is highly concentrated in just a tiny fraction of the overall event window. For any activity involving live trades, we had to get the results on their screen within 100ms from the moment they initiated the action. That means their network roundtrip latency ate into our event processing budget.

0: https://www.nngroup.com/articles/response-times-3-important-...

[+] swiftcoder|1 year ago|reply
Point 1 is the kicker, I think, for something like a business insider clone. If a big twitter account links to one of your articles, you may get the whole damn lot of average monthly page loads over the next few minutes.

And you absolutely can deal with that sort of load in single large server configurations, but now we're not just building a webserver, we're building a pretty hardcore frontend load balancer (that happens to have an embedded webserver).

They may well charge through the nose for it, but a hell of a lot of engineering has gone into AWS' load balancers and network infrastructure, so that the rest of us don't have to become experts in that whole segment of the stack.

[+] bcaxis|1 year ago|reply
> 2. Unidirectional end-to-end latency only applies to streaming data.

Agreed that his "across the world" example is a bit silly. Because he doesn't take into account the connection construction.

His primary point is still reasonable. How many services need world wide reach? Did you build it for multiple languages also?

If you're in the US. Or you're in the EU. A nice centralized server will have <=30 ms of latency to the entire region you are serving.

Edge is over valued unless you do have true global needs and then you have to also manage global database (s).

[+] robjan|1 year ago|reply
> 500 Internal Server Error

Looks like the author is getting more than 11MB/s traffic. Here's an archived version: https://archive.is/UVpg0

[+] floating-io|1 year ago|reply
I feel like this is the wrong way to look at it.

A better way IMO is: don't scale prematurely.

Build things as you need them. In the vast majority of cases, even CDNs are an unnecessary cost (presuming you're not paying the exorbitant cloud provider bandwidth tax). If you start to see performance issues, then deal with it as needed.

And if your workhorse suddenly grows that coveted single horn?

That's a problem you want to have!

[+] Manfred|1 year ago|reply
With AWS you also buy a scapegoat because it's much easier to explain to superiors or investors when a large cloud service has downtime than it is to explain the same cumulative downtime caused by human error in your team.
[+] waldrews|1 year ago|reply
But why SQLite, if you're going for vertical scaling? Nobody's stopping you from self hosting Postgres (or Supabase!) on the same server as your app, and I can't think of any disadvantage other than more effort to set up.

Now if you're willing to go DB-less, keep your whole global state in literal memory on one big server (no round-tripping to Redis or whatever, actual in-process objects), just occasionally snapshotting that memory to disk (this part's tricky), and use a compiled, multi-threaded language -- then you can saturate a Gbit or bigger NIC and literally serve the world from one box. I kind of wish I had a real use case for that architecture.

[+] thomascountz|1 year ago|reply
Why not SQLite? :) Of course the answer is always "it depends," but lately, I've seen the general "SQLite isn't a real database" ethos challenged more and more. Outside of standard relational persistence patterns, there can be significant feature differences that could very well mean Postgres is the better option. However, for some architectural patterns, SQLite could come out ahead! For a content-heavy application like BusinessInsider, the Baked Data pattern with SQLite might very well offer better cost and latency performance!

simonw (datasette) has built troves and troves of tools and writings about SQLite's production use for content-heavy and/or data rich websites: https://simonwillison.net/2021/Jul/28/baked-data/

[+] klabb3|1 year ago|reply
> But why SQLite, if you're going for vertical scaling?

From the benchmarks I’ve seen, because it’s significantly faster, specifically round trip times. This would make sense since SQLite is in-process and doesn’t need serialization[1]. Which in turn offers a second, optional advantage within reach - serial processing of operations, which is significantly easier to test, reason about, and build supporting cache layers around.

[1]: If you don’t use a Unix socket you also have networking overhead - but you said same server so I’ll leave this as a side-note since it’s extremely common to put postgres on a different machine for isolation. In fact, it’s one of the main advantages with networked dbs.

[+] Sammi|1 year ago|reply
If I'm running on a single machine then sqlite comes out ahead of postgres/mysql. Sqlite has everything I need plus the simplicity and speed is superior. Sqlite can do terabytes of data, can handle multiple readers, has live streaming backup, and is a pretty well rounded sql implementation in general.

I would only consider postgres/mysql if I outgrew vertically scaling a single box.

[+] thomascountz|1 year ago|reply
I'm thinking that a lot of commenters are—understandably—feeling the need to defend the status quo in the name of availability and reliability since the author zeroed-in on latency, bandwidth, and spend.

My takeaway is not a debate against the merits of cloud in the face of trade offs, but against the necessity of—the now ubiquitous—cloud architecture pattern (and related lock-in). The "this-vs-that" is a rhetorical device to introduce an alternative. Of course which solutions are right for different use cases will depend on so many different things, and it's those things that keep engineers employed! :)

That said, we can put our engineering hats on to solve for SRE concerns within the patterned proposed by the author; I'm thinking the "what about availability when your one server goes down?!" is a straw-man in the sense that we have different ways of solving the availability story than our Ubiquitous System, and of course the solution must depend on what's actually relevant.

[+] nostrebored|1 year ago|reply
How is it a strawman? It’s what anyone sane operating on premise or in the cloud would ask about this in design review
[+] klabb3|1 year ago|reply
Here’s something you can do for a little extra step that offers a lot more on a frugal budget: put your API and SQLite db together. Ideally, the API uses a low-overhead binary serialization format and persistent conns. Then you can use edge stuff from whatever VC firm that is currently burning money and enjoy the free tier (currently though, Cloudflare workers is really generous - and free egress).

The trick is that SQLite can run single-threaded a massive amount of qps. So you can pump large amounts of ops in serial which is trivial to reason about and allows for in-memory caching on the API side with trivial invalidation. And by still keeping your web serving separate, you can still enjoy edge-performance of handshakes, and skip db altogether for static pages. (The article underestimates the RTT issue - it’s very real - real world apps need more round trips than you think)

Most of cpu outside of the db comes from parsing, deserializing, copying data, tls (both number of conns and encrypting data). By offloading the big chunks of these you can get easily 10s of thousands of writes/s on an entry level machine. Reads are faster.

That said, I think it’s always worth benchmarking the typical bottlenecks especially io. Providers are lying and misleading a lot, so just run your own on their free tier. Just make sure to have some integration test/bench ready in case you need to switch up.

[+] mg|1 year ago|reply

    You need to be on the edge, they say.
    Be close to your users. Minimize latency.
How much of an issue is latency in practice?

Here is an example. My book recommendation project Gnooks, which I run on a server in Germany:

https://www.gnooks.com

Does it feel too slow for anybody?

Over the last years, I have gotten many thousands of suggestions from users on this project. Yet, as far as I can remember, nobody ever touched the topic of latency. While the largest group of users is from the USA.

[+] zokier|1 year ago|reply
The article does not even mention availability. Service running on a single box will have downtime, both planned and unplanned. And you should think about RPO/RTO too, when (not if!) the box blows up how long does it take to recover, and how much data will you lose.

> No matter how you design your site, SPA, SSR, some hybrid in between you can’t get around that if there is at least one database query involved in rendering your page, you have to go back to your database in us-east-1.

https://aws.amazon.com/rds/aurora/global-database/

https://aws.amazon.com/dynamodb/global-tables/

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Conce...

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/...

And so on.. Have fun kludging together the equivalents yourself.

[+] floating-io|1 year ago|reply
Availability is overblown these days IMO.

Ten or twenty years ago I would have agreed with you: the internet was new, and when the site went down, people blamed the site, and things got nasty fast. Nowadays, though?

People are much more likely to blame their internet provider first, or just plain try again later. They're used to the unreliable nature of the internet now.

Unless you're Google, but how many businesses are Google?

As someone else on this thread said, simpler systems are less likely to fail anyway. Keep backups, replicate your DB to a cold DR site if you're truly worried. That will handle the majority of non-FAANG businesses in the majority of non-FAANG cases.

And if that failure occurs? So long as you don't lose data, it will be a short-lived blip for most businesses. Even being down for a few days can be weathered if you have good PR, provided that it's a once in a blue moon sort of event. For that matter, data losses can be weathered in many cases.

What was that saying? Each nine of reliability doubles the cost or some such?

That needs to be accounted for when determining the ROI. How reliable does your business actually need to be?

That depends on your target market. The default answer from IT, however, is an automatic "100%!". This is wrong, IMO.

[+] simfree|1 year ago|reply
Running on bare metal nets you serious latency reductions and I/O bandwidth increases compared to the products you linked to, and you have significantly less complexity in a single server or dual server model, that is why operations like LetsEncrypt choose to use this dual physical server architecture to power their global services at web scale: https://letsencrypt.org/2021/01/21/next-gen-database-servers...

Very low internal latency from an on server database or directly adjacent database allows your software to run database queries orders of magnitudes faster and spend less resources per client served, resulting in a speedier, snappy experience for your end users compared to what any of these managed database services can provide in practice.

[+] chmod775|1 year ago|reply
> The article does not even mention availability. Service running on a single box will have downtime.

So will your service running on AWS or equivalent. Complexity brings its own pitfalls and I don't think there's a web service in existence that never screwed this up.

That is if it's not AWS itself having a screw-up.

[+] Sammi|1 year ago|reply
You will have major availability problems on AWS.

VPS is not worse off here.

[+] Fannon|1 year ago|reply
A lot of good arguments which I think are often under looked. "We need to go into the cloud" often feels more like a peer pressure thing. It might be difficult to get a Hetzner server through the ordering processes of your company, compared to self-service rent something from AWS or the like (feels like this may be equal reason why they're so successful?).

But the article completely mises out on the aspect of availability, which would be quite risky with a single server.

Why not do mix & match smartly? A CDN is already "the cloud". Having your static content hosted (and pre-generated fully) in the cloud and only go on dynamic generated / rendered responses if you really have to. Then you have good latency, high availability and still low costs.

[+] bcaxis|1 year ago|reply
I think it's more an architecture question. If one box works, stop building on services that force horizontal thinking, and pricing, from the get go.

You can solve one box availability with box 2 (hot backup) - all within the same architecture and price structure.

[+] aragilar|1 year ago|reply
While I somewhat agree with the premise, "Plop one in Virginia and you can get the English speaking world in under 100ms of latency." is just false (unless you only count the US+UK as "English speaking", ignoring the rest of the Commonwealth...).
[+] Turing_Machine|1 year ago|reply
Well, there's Canada, Ireland, and the English-speaking islands of the Caribbean, also.

The equatorial African nations aren't that much farther away than Great Britain.

Of course that does still leave out Southern Africa, India and Australia.

[+] MichaelRo|1 year ago|reply
It's implied from context that he means "US-English" :)
[+] erik_seaberg|1 year ago|reply
I like the power that we see with cheap commodity hardware, but we mustn't forget that it's crap. He talks about Litestream as a continuous backup but doesn't get into how long it takes to bring up a hot spare and whether he's automated that. I'm glad to see Litestream has synchronous replication on their roadmap, but in the meantime writes can be irretrievably lost and I don't know whether transaction boundaries are preserved.
[+] bcaxis|1 year ago|reply
> I like the power that we see with cheap commodity hardware, but we mustn't forget that it's crap.

Google grew up (early days) using cheap white box consumer PCs while "best practice" was expensive server boxes.

It's a tried and true method of budget hosting.

The mini PC world is exploding and they make for a solid low cost, low power server platform.

It also makes having hot and cold spares cheap and easy.

[+] amne|1 year ago|reply
This is all good advice .. for your hobby project. But the moment that server goes down and you have stakeholders draining your phone's battery you'll realize that scaling also brings availability as a very nice bonus. Sure, 99% of the time it looks like you're wasting resources. Until you need it and you do the math and realize you're way ahead in revenue not lost.
[+] sroussey|1 year ago|reply
My experience over the decades is an insider dirty secret: downtime doesn’t really affect revenue.
[+] jiggawatts|1 year ago|reply
I used to think this way until in noticed that every clustering technology I had ever used had caused more outages than it had prevented.
[+] bcaxis|1 year ago|reply
Start by building a business that isn't differentiated by how many 9s you have. Something customers want so badly that a few hours of inconvenient downtime doesn't move the needle at all.

In this situation, blowing up your system complexity to maybe get another 9 makes no sense. Then the revenue change is pretty irrelevant for modest downtime.

People under estimate single server uptime. If availability is really that important, buy a hot backup. Put it in another region. Done.

[+] Sammi|1 year ago|reply
AWS has major availability issues. VPS is not worse off here. If anything it is easier to debug.
[+] groggo|1 year ago|reply
I'm scared of running stuff on servers.

Jk. But kinda. I have a little hobby website. Mostly for fun, I keep rewriting it in different languages, stacks, and deployment methods. At one point it was on a $5 AWS server. It continuously ran out of memory and crashed. Then it crashed for some other reason, I think because it ran out of disc space from log files. Then the postgres database got encrypted for ransom because I stupidly left it open without a password.

So now I use things like Fly.io and Firebase. And they work great for hobby stuff. I'd like for my projects to grow, and this article makes a good case why I should be competent enough to run them on a server myself.

At work I help run a much larger website that we run with K8s and a managed db. The idea of directly running that on servers seems equally daunting.

But I know it shouldn't be that way. Thanks for the reminder.

[+] Brian_K_White|1 year ago|reply
There are people with a specific huge vested interest in making sure you always feel that way. So that alone is a good reason to try to resist that feeling.

Not just for the obvious reason to grow your abilities, because you can say exactly the same about everything and you can't become expert in everything.

But simply because someone makes a lot of money off of you if you feel like you need them, and they are big enough to do many indirect things and change the entire environment to make you feel you need them and never even question it, and suffer essentially ostricisation if you ever do question it.

Making those baby sysadmin mistakes is perfectly fine. Everyone must make them. Are you ever going to forget to secure the access to any db after that? That is not only one specific conig but an entire class of problem that you are alert to now.

Not only is it ok because it's low stakes and then you know better for high stakes at work after that, but really even at work it should be normal to suffer a breakage once in a while, because at work it's even more important that you know how to recover when it inevitably happens anyway even without making mistakes. I don't mean break things on purpose, I just mean if you never suffer a problem, you never become prepared to deal with a problem. That is not good.

Besides, no matter what you still suffer equivalent problems, cloud or no cloud. What's the difference between your db going down from a hardware fault or misconfig, or a cloud account getting killed because of a billing or tos error?

Also, k8s actually makes an otherwise manageable system into a daunting one.

All in all, more and better reasons to be brave than to be afraid.

[+] neurostimulant|1 year ago|reply
What doesn't kill you makes you stronger. You apply what you learn to make your next server setup better. Your db got encrypted by a ransomware? Then for your next setup you'll make sure to configure the database to only listen to local network connections and have daily offsite backup. Out of disk space due to huge log files? Then your next setup will have an automatically rotated log files. Your apps died without you noticing? Then your next setup will feature health checks and alerting. Repeat this enough time and you'll eventually have a bullet proof setup.

Of course, not wanting to go through all of this is a valid choice too, especially if you have no interest in running your own services.

[+] jdub|1 year ago|reply
> Plop one in Virginia and you can get the English speaking world in under 100ms of latency.

Never start a fist fight with an Australian. Let alone 26 million of us at the same time.

[+] akdor1154|1 year ago|reply
Well he gets a 150ms reaction time advantage to the incoming punches so he might be ok.
[+] bombcar|1 year ago|reply
I think the standard response is that Aussies certainly speak something, but that it’s arguable if it’s English or not. ;)
[+] aragilar|1 year ago|reply
Or the kiwis!
[+] newzisforsukas|1 year ago|reply
Meanwhile, I was just served a 503
[+] iainmerrick|1 year ago|reply
Yeah, might be an idea to move this site to S3 so it works.
[+] ssl-3|1 year ago|reply
Didn't we already do all of this 30-ish years ago?

And I'm not trying to suggest that we haven't learned a lot since then, but:

There was a time when the corpo web server/Internet computer existed as a pizza-box Sun Microsystems machine on someone's desk.

And sure, bandwidth was a lot less than 11MB/sec back then, but that's not a stretch at all for today's modern equivalent to that expensive pizza box.

But I'm old, and man do I sure as fuck remember the Web being a generally-unreliable turd back then. It was common for a website to not work today, or for an ISP's solitary email server to be down for a week or more.

It sure felt more real (and I even built a couple of those email servers), but it was also obviously very broken some of the time in ways that people don't generally accept today -- especially with 400 million visits per month, which was largely unfathomable at the time.

[+] inopinatus|1 year ago|reply
Reliability steps were made when we updated the “sparcstation 1s in a university lab” hosting architecture to what marketing called a “hot-swappable blade server with integral UPS” i.e. a bunch of second-hand thinkpads on a rack shelf above the Livingston Portmaster
[+] chefandy|1 year ago|reply
Like many others, folks around here are often blind to nostalgia's impact on their perception of 'good ol days' tech. Just because something was simpler to write or administer does't man it was better for users, and there are a lot more users.
[+] graemep|1 year ago|reply
A single server in a data centre now is a lot more reliable than a machine sitting on someone's desk back then. The location probably makes the biggest difference - reliable network connections.
[+] redundantly|1 year ago|reply
That was a pretty good read. The edging reference made me chuckle.
[+] nostrebored|1 year ago|reply
A lot of thoughts —

Most businesses are not running a SPA which only makes a database call. Most infrastructure I’ve seen is not web-facing. Managing this from a consistent place reduces operations overhead.

The idea that you’re SSRing everything instead of making a few API calls is strange to me. Most businesses in the top 1000 will have optimized for caching when at all possible. They’re also not paying sticker price to CDNs.

Offloading works to CDNs comes with inherent operational benefit.

This all seems great until you have your infra fail. Stories of unrecoverable outages abound. Even if you run in this architecture, you should always always always be using an external party for backup and log retention.

More of the above… this doesn’t interact well with the real world, but everything old is new again and many new CTOs /founding engineers haven’t seen how infra breaks and how operations impact throughput.