top | item 46825302

Stop using low DNS TTLs

66 points| swills | 1 month ago |blog.apnic.net

45 comments

order

csense|29 days ago

DNS is something you rarely change that has costly consequences if you mess it up: It can bring down an entire domain and keep it down until TTL passes.

If you set your TTL to an hour, it raises the costs of DNS issues a lot: A problem that you fix immediately turns into an hour-long downtime. A problem that you don't fix on the first attempt and have to iteratively try multiple fixes turns into an hour-per-iteration downtime.

Setting a low TTL is an extra packet and round-trip per connection; that's too cheap to meter [1].

When I first started administering servers I set TTL high to try to be a good netizen. Then after several instances of having to wait a long time for DNS to update, I started setting TTL low. Theoretically it causes more friction and resource usage but in practice it really hasn't been noticeable to me.

[1] For the vast majority of companies / applications. I wouldn't be surprised to learn someone somewhere has some "weird" application where high TTL is critical to their functionality or unit economics but I would be very surprised if such applications were relevant to more than 5% of websites.

gertop|1 month ago

The irony here is that news.ycombinator.com has a 1 second TTL. One DNS query per page load and they don't care, yay!

a012|1 month ago

Joke on them because I use NextDNS with caching so all TTL is 3600s

compumike|29 days ago

The big thing that articles like this miss completely is that we are no longer in the brief HTTP/1.0 era (1996) where every request is a new TCP connection (and therefore possibly a new DNS query).

In the HTTP/1.1 (1997) or HTTP/2 era, the TCP connection is made once and then stays open (Connection: Keep-Alive) for multiple requests. This greatly reduces the number of DNS lookups per HTTP request.

If the web server is configured for a sufficiently long Keep-Alive idle period, then this period is far more relevant than a short DNS TTL.

If the server dies or disconnects in the middle of a Keep-Alive, the client/browser will open a new connection, and at this point, a short DNS TTL can make sense.

(I have not investigated how this works with QUIC HTTP/3 over UDP: how often does the client/browser do a DNS lookup? But my suspicion is that it also does a DNS query only on the initial connection and then sends UDP packets to the same resolved IP address for the life of that connection, and so it behaves exactly like the TCP Keep-Alive case.)

hannasm|29 days ago

  > patched an Encrypted DNS Server to store the original TTL of a response, defined as the minimum TTL of its records, for each incoming query

The article seems to be based on capturing live dns data from some real network. While it may be true that persistent connections help reduce ttl it certainly seems like the article is accounting for that unless their network is only using http1.0 for some reason.

I agree that low TTL could help during an outage if you actually wanted to move your workload somewhere else, and I didn't see it mentioned in the article, but I've never actually seen this done in my experience, setting TTL extremely low for some sort of extreme DR scenario smells like an anti pattern to me.

Consider the counterpoint, having high TTL can prevent your service going down if the dns server crashes or loses connectivity.

GuinansEyebrows|1 month ago

i was taught this as a matter of professional courtesy in my first job working for an ISP that did DNS hosting and ran its own DNS servers (15+ years ago). if you have a cutover scheduled, lower the TTL at $cutover_time - $current_ttl. then bring the TTL back up within a day or two in order to minimize DNS chatter. simple!

of course, as internet speeds increase and resources are cheaper to abuse, people lose sight of the downstream impacts of impatience and poor planning.

tracker1|1 month ago

I usually set mine to between an hour and a day, unless I'm planning to update/change them "soon" ... though I've been meaning to go from a /29 to /28 on my main server for a while, just been putting off switching all the domains/addresses over.

Maybe this weekend I'll finally get the energy up to just do it.

Neywiny|1 month ago

I guess I'm not sure I understand the solution. I use a low value (idk 15 minutes maybe?) because I don't have a static ip and I don't want that to cause issues. It's just me to my home server so I'm not adding noticable traffic like a real company or something, but what am I supposed to do? Is there a way for me to send an update such that all online caches get updated without needing to wait for them to time out?

viraptor|1 month ago

For a private server with not many users this is mostly irrelevant. Use low ttl if you want to, since you're putting basically 0 load on the DNS system.

> such that all online caches get updated

There's no such thing. Apart from millions of dedicated caching servers, each end device will have it's own cache. You can't invalidate DNS entries at that scope.

zamadatix|1 month ago

I used to get more excited about this but even when browsers don't do a DNS prefetch (or even a complete preload) the latency for lookups is usually still so low on the list of performance impacting design decisions that it is unlikely to ever outweigh even the slightest advantages (or be worth correcting misperceived advantages) until we all switch to writing really really REALLY optimized web solutions.

garciasn|1 month ago

Could it be because folks set it low for initial propagation and then never change it back after they set it up.

fukawi2|1 month ago

That's not how TTL works. Or do you mean propagation after changing an existing RR?

It's "common" to lower a TTL in preparation for a change to an existing RR, but you need to make sure you lower it at least as long as the current TTL prior to the change. Keeping the TTL low after the change isn't beneficial unless you're planning for the possibility of reverting the change.

A low TTL on a new record will not speed propagation. Resolvers either have the new record cached or they don't. If it's cached, the TTL doesn't matter because it already has the record (propogated). If it doesn't have it cached, then it doesn't know the TTL so doesn't matter if it's 1 second or 1 month.

deceptionatd|1 month ago

Maybe, but I don't think TTL matters for speed of initial propagation. I do set it low when I first configure a website so I don't have to wait hours to correct a mistake I might not have noticed.

deceptionatd|1 month ago

I have mine set low on some records because I want to be able to change the IP associated with specific RTMP endpoints if a provider goes down. The client software doesn't use multiple A records even if I provide them, so I can't use that approach; and I don't always have remote admin access to the systems in question so I can't just use straight IPs or a hostfile.

nubinetwork|29 days ago

> However, no one moving to a new infrastructure is going to expect clients to use the new DNS records within 1 minute, 5 minutes or 15 minutes

When you run a website that receives new POSTed information every 60 seconds, you sure do. ;)

arter45|29 days ago

Meaning some kind of API that is periodically polled?

bjourne|1 month ago

I don't understand why the author doesn't consider load balancing and failover legitimate use cases for low ttl. Cause it wrecks their argument?

kevincox|1 month ago

Because unless your TTL is exceptionally long you will almost always have a sufficient supply of new users to balance. Basically you almost never need to move old users to a new target for balancing reasons. The natural churn of users over time is sufficient to deal with that.

Failover is different and more of a concern, especially if the client doesn't respect multiple returned IPs.

BitPirate|1 month ago

Why do you need a low ttl for those? You can add multiple IPs to your A/AAAA records for very basic load balancing. And DNS is a pretty bad idea for any kind of failover. You can set a very low ttl, but providers might simply enforce a larger one.

Bender|1 month ago

Perhaps as most these days are using Anycast [1] to do failovers. It's faster and not subject to all the oddities that come with every application having its own interpretation of DNS RFC's most notably java and all its work-arounds that people may or may not be using and all the assorted recursive cache servers that also have their own quirks thus making Anycast a more reliable and predictable choice.

[1] - https://en.wikipedia.org/wiki/Anycast

c45y|1 month ago

Probably an expectation for floating IPs for load balancing instead of DNS.

Relatively simple inside a network range you control but no idea how that works across different networks in geographical redundant setups

jurschreuder|1 month ago

It's because updating dns does not work reliably so it's always a lot if trail and error which you can only see after the cache updates

joelthelion|29 days ago

Could you make your changes with a low TTL and switch to a longer one once you are satisfied with the results?

ece|29 days ago

I've never changed the default, Squarespace, Godaddy, Cloudflare, Porkbun, all have been an hour or so.

effnorwood|1 month ago

Sometimes they need to be low if you use the values to send messages to people.

UltraSane|29 days ago

DNS TTLs are a terrible method of directing traffic because you can't rely on clients to honor it.