top | item 28053168

The 5-Hour CDN

417 points| robfig | 4 years ago |fly.io | reply

85 comments

order
[+] simonw|4 years ago|reply
This article touches on "Request Coalescing" which is a super important concept - I've also seen this called "dog-pile prevention" in the past.

Varnish has this built in - good to see it's easy to configure with NGINX too.

One of my favourite caching proxy tricks is to run a cache with a very short timeout, but with dog-pile prevention baked in.

This can be amazing for protecting against sudden unexpected traffic spikes. Even a cache timeout of 5 seconds will provide robust protection against tens of thousands of hits per second, because request coalescing/dog-pile prevention will ensure that your CDN host only sends a request to the origin a maximum of once ever five seconds.

I've used this on high traffic sites and seen it robustly absorb any amount of unauthenticated (hence no variety on a per-cookie basis) traffic.

[+] sleepy_keita|4 years ago|reply
Back when I was just getting started, we were doing a lot of WordPress stuff. A client contacted us, "oh yeah, later today we're probably going to have 1000x the traffic because of a popular promotion". I had no idea what to do so I thought, I'll just set the varnish cache to 1 second, that way WordPress will only get a maximum of 60 requests per second. It worked pretty much flawlessly, and taught me a lot about the importance of request coalescing and how caches work.
[+] sciurus|4 years ago|reply
I'll echo what Simon said; we share some experiences here. There's a potential footgun, though, anyone getting started with this should know about-

Request coalescing can be incredibly beneficial for cacheable content, but for uncacheable content you need to turn it off! Otherwise you'll cause your cache server to serialize requests to your backend for it. Let's imagine a piece of uncacheable content takes one second for your backend to generate. What happens if your users request it at a rate of twice a second? Those requests are going to start piling up, breaking page loads for your users while your backend servers sit idle.

If you are using Varnish, the hit-for-miss concept addresses this. However, it's easy to implement wrong when you start writing your own VCL. Be sure to read https://info.varnish-software.com/blog/hit-for-miss-and-why-... and related posts. My general answer to getting your VCL correct is writing tests, but this is a tricky behavior to validate.

I'm unsure how nginx's caching handles this, which would make me nervous using the proxy_cache_lock directive for locations with a mix of cacheable and uncacheable content.

[+] mnutt|4 years ago|reply
In varnish, if you have some requirements flexibility you can enable grace mode in order to serve stale responses but update from the origin, and avoid long requests every [5] seconds.

Not quite the same layer, but in node.js I’m a fan of the memoize(fn)->promise pattern where you wrap a promise-returning function to return the _same_ promise for any callers passing the same arguments. It’s a fairly simple caching mechanism that coalesces requests and the promise resolves/rejects for all callers at once.

[+] skunkworker|4 years ago|reply
I've implemented this manually in some golang web applications I've written. It really helps when you have an expensive cache-miss operation, as it can stack the specific requests so that once the original request is served, all of the stacked requests are served with the cached copy.
[+] cortesoft|4 years ago|reply
"Thundering herd" problem is how I have always heard it called.
[+] philsnow|4 years ago|reply
unrelated to CDNs but IIRC vitess did/does query coalescing too -- if it starts to serve a query for "select * from users where id = 123" and then another 20 connections all want the same query result, vitess doesn't send all 21 select queries to the backend, it sends the first one and then has all the connections wait on the backend response, then serves the same response to them all.
[+] anonymoushn|4 years ago|reply
Do you know if varnish's request coalescing allows it to send partial responses to every client? For example, if an origin server sends headers immediately then takes 10 minutes to send the response body at a constant rate, will every client have half of the response body after 5 minutes?

Thanks!

[+] dbbk|4 years ago|reply
Is this the same idea as `stale-while-revalidate`?
[+] jabo|4 years ago|reply
Love the level of detail that Fly's articles usually go into.

We have a distributed CDN-like feature in the hosted version of our open source search engine [1] - we call it our "Search Delivery Network". It works on the same principles, with the added nuance of also needing to replicate data over high-latency networks between data centers as far apart as Sao Paulo and Mumbai for eg. Brings with it another fun set of challenges to deal with! Hoping to write about it when bandwidth allows.

[1] https://cloud.typesense.org

[+] mrkurt|4 years ago|reply
I'd love to read about it.
[+] amirhirsch|4 years ago|reply
This is cool and informative and Kurt's writing is great:

The briny deeps are filled with undersea cables, crying out constantly to nearby ships: "drive through me"! Land isn't much better, as the old networkers shanty goes: "backhoe, backhoe, digging deep — make the backbone go to sleep".

[+] tptacek|4 years ago|reply
We can't take credit for the backhoe thing; that really is an old networking shanty.
[+] babelfish|4 years ago|reply
fly.io has a fantastic engineering blog. Has anyone used them as a customer (enterprise or otherwise) and have any thoughts?
[+] joshuakelly|4 years ago|reply
Yes, I'm using it. I deploy a TypeScript project that runs in a pretty straightforward node Dockerfile. The build just works - and it's smart too. If I don't have a Docker daemon locally, it creates a remote one and does some WireGuard magic. We don't have customers on this yet, but I'm actively sending demos and rely on it.

Hopefully I'll get to keep working on projects that can make use of it because it feels like a polished 2021 version of Heroku era dev experience to me. Also, full disclosure, Kurt tried to get me to use it in YC W20 - but I didn't listen really until over a year later.

[+] jbarham|4 years ago|reply
One of my side projects is a DNS hosting service, SlickDNS (https://www.slickdns.com/).

I moved my authoritative DNS name servers over to Fly a few months ago. After some initial teething issues with Fly's UDP support (which were quickly resolved) it's been smooth sailing.

The Fly UX via the flyctl command-line app is excellent, very Heroku-like. Only downside is it makes me mad when I have to fight the horrendous AWS tooling in my day job.

[+] mike_d|4 years ago|reply
I run my own worldwide anycast network and still end up deploying stuff to Fly because it is so much easier.

The folks who actually run the network for them are super clueful and basically the best in the industry.

[+] cgarvis|4 years ago|reply
just started to use them for an elixir/phoenix project. multi region with distributed nodes just works. feels almost magically after all the aws work I've done the past few years.
[+] corobo|4 years ago|reply
I read their blogs and I visit their site every new project I start but it just hasn't clicked with me yet.

Tinkering has been great but the addon style pricing scares the jeebs out of me (my wallet), I just assume I can't afford it for now and spin up a DO droplet. The droplet is probably more expensive for my use case but call it ADHD tax haha, at least it's capped

[+] alopes|4 years ago|reply
I've used them in the past. All I can say is that the support was (and probably still is) fantastic.
[+] vmception|4 years ago|reply
>The term "CDN" ("content delivery network") conjures Google-scale companies managing huge racks of hardware, wrangling hundreds of gigabits per second. But CDNs are just web applications. That's not how we tend to think of them, but that's all they are. You can build a functional CDN on an 8-year-old laptop while you're sitting at a coffee shop.

huh yeah never thought about it

I blame how CDNs are advertised for the visual disconnect

[+] lupire|4 years ago|reply
It's misleading.

CDN software might be simple in the basic happy case, but you still need a Network of nodes to Deliver the Content.

[+] daniel_iversen|4 years ago|reply
Years ago I was involved with some high performance delivery of a bunch of newspapers, and we used Squid[1] quite well. One nice thing you could do as well (but it's probably a bit hacky and old school these days) was to "open up" only parts of the web page to be dynamic while the rest was cached (or have different cache rules for different page components)[2]. With some legacy apps (like some CMS') this can hugely improve performance while not sacrificing the dynamic and "fresh looking" parts of the website.

[1] http://www.squid-cache.org/ [2] https://en.wikipedia.org/wiki/Edge_Side_Includes

[+] 3np|4 years ago|reply
As someone who’s mostly clueless about BGP but have a fair grasp of all the other layers mentioned, I’d love to see posts like this going more in depth on it for folks like myself.
[+] youngtaff|4 years ago|reply
Some of the things they miss in the post are Cloudflare uses a customised version or Nginx, same with Fastly for Varnish (don't know about Netlify and ATS)

Out of the box nginx doesn't support HTTP/2 prioritisation so building a CDN with nginx doesn’t mean you're going ti be delivering as good service as Cloudflare

Another major challenge with CDNs is peering and private backhaul, if you're not pushing major traffic then your customers aren't going to get the best peering with other carriers / ISPs…

[+] mike_d|4 years ago|reply
HTTP/2 prioritization is a lot of hype for a theoretical feature that yields little real world performance. When a client is rendering a page, it knows what it needs in what order to minimize blocking. The server doesn't.
[+] jusssi|4 years ago|reply
> 3. Be like a game server: Ping a bunch of servers and use the best. Downside: gotta own the client. Upside: doesn't matter, because you don't own the client.

"If you can run code on it, you can own it". Your front page could just be a tiny loader js that fires off a fetch() for a zero byte resource to all your mirrors, and then proceeds to load the content from the first responder.

[+] marcosdumay|4 years ago|reply
Now you just have the bad latency of the non-cached content, plus the ok latency of your CDN.
[+] cpascal|4 years ago|reply
> DNS: Run trick DNS servers that return specific server addresses based on IP geolocation. Downside: the Internet is moving away from geolocatable DNS source addresses. Upside: you can deploy it anywhere without help.

Can anyone expand on how/why "the Internet is moving away from geolocatable DNS source addresses"?

[+] mritzmann|4 years ago|reply
Some public/recursive DNS Servers like Cloudflare (1.1.1.1) do not tell the authoritative dns server the ip address or subnet of the requestor. Your ISP's DNS server usually does this. This makes CDN via DNS more difficult, as it is not always entirely clear from where the request comes (Cloudflare itself does not need this, they do everything with Anycast).
[+] Rd6n6|4 years ago|reply
Sounds like a fun weekend project
[+] ksec|4 years ago|reply
It is strange that you put a Time duration in front of CDN ( content delivery network ), because given all the recent incident with Fastly, Akamai and Bunny, I read it as 5 hours Centralised Downtime Network.
[+] intricatedetail|4 years ago|reply
Does Nginx still not support cache invalidation? If you setup long TTL, is there a way to remove some files from cache without nuking entire cache and restarting an instance?
[+] 33degrees|4 years ago|reply
It's supported, but only for NGINX Plus. You can kind of work around it by using proxy_cache_bypass though
[+] cortesoft|4 years ago|reply
The hard part of building a CDN is not setting up an HTTP cache, it is setting up an HTTP cache that can serve thousands of different customers.
[+] mrkurt|4 years ago|reply
Making a service multitenant is more complex, yes. But many companies roll their own CDNs. There are lots of good reasons to do that, and it's a problem that can be reduced to a single developer for understanding.
[+] legrande|4 years ago|reply
I like to blog from the raw origin and not use CDNs because if a blogpost is changed I have to manually purge the CDN cache, which can happen a lot. Also CDNs have the caveat in that if they're down, it can make a page load very slow since it tries to load the asset.
[+] tshaddox|4 years ago|reply
If you’re okay with every request having the latency all the way to your origin, you can have the CDN revalidate its cache on every request. Your origin can just check date_updated (or similar) on the blog post to know if the cache is still valid without needing to do any work to look up and render the whole post.

To further reduce load and latency to your origin, you can use stale-while-revalidate to allow the CDN to serve stale cache entries for some specified amount of time before requiring a trip to your origin to revalidate.

[+] raro11|4 years ago|reply
I set an s-maxage of at least a minute. Keeps my servers from being hugged to death while not having to invalidate manually.
[+] cortesoft|4 years ago|reply
You can fix this with proper cache headers
[+] parentheses|4 years ago|reply
Author has a great sense of humor. I love it!
[+] mbStavola|4 years ago|reply
Fly is great and I love reading their blog posts.

Just hoping they come back around on CockroachDB-- I feel like it's a match made in heaven for what they're providing.

[+] tptacek|4 years ago|reply
We love CockroachDB. There are people tinkering with it on Fly.io. I think anything formal would involve our companies talking to each other, which we're happy to do, but everybody is busy all the time. :)
[+] amelius|4 years ago|reply
Waiting for IPFS to shake this all up.