top | item 21436448

Stop using low DNS TTLs

318 points| fanf2 | 6 years ago |00f.net | reply

143 comments

order
[+] teddyh|6 years ago|reply
I operate authoritative name servers for almost 10.000 domains. Originally, I used a default TTL of 2 days, as recommended by RIPE-203¹ (which is also compatible with the recommendations of RFC 1912²), but this was not accepted by users, who didn’t want to wait two days. Therefore, for all records except SOA and NS records, I changed the default TTL to one hour, which I still use as the default value unless a change is scheduled and/or planned, in which case I lower it to 5 minutes. I do not want to lower it any more, as I’ve heard rumors of buggy resolvers interpreting “too low” TTLs as bad, and reverting to some very-high default TTL, and thereby wrecking my carefully planned DNS changeover. I have, however, not seen any real numbers or good references on what numbers are “too low”, and would like to hear from anyone who might have some information on this.

1. https://www.ripe.net/publications/docs/ripe-203

2. https://tools.ietf.org/html/rfc1912#page-4

[+] onefuncman|6 years ago|reply
Unless you have insight into the end users DNS deployments I would say this is the appropriate amount of caution to apply. Besides just TTL being low, a frequent issue I had when first migrating to AWS years ago was CNAME to CNAME records not resolving among some end users. Primary schools were the worst offenders, I assume some of them still have Novell deployed.
[+] jedberg|6 years ago|reply
The irony of all of this is that those TTLs are almost meaningless as a server operator anyway. Even if you set your TTL to 5 minutes, there are a whole lot of clients that will ignore it.

When I made a DNS switch at reddit, even with a 5 minute TTL, it still took an hour for 80% of the traffic to shift. After a week, only 95% had shifted. After two weeks we still had 1% of traffic going to the old IP.

And after a month there was still some traffic at the old endpoint. At some point I just shut off the old endpoint with active traffic (mostly scrapers with hard coded IPs at that point as far as I could tell).

One of my friends who ran an ISP in Alaska told me that they would ignore all TTLs and set them all to 7 days because they didn't have enough bandwidth to make all the DNS queries to the lower 48.

So yeah, set your TTL to 40 hours. It won't matter anyway. In an emergency, you'll need something other than DNS to rapidly shift your traffic (like a routed IP where you can change the router configs).

[+] pdkl95|6 years ago|reply
> Why are DNS records set with such low TTLs?

The author seem to be missing one of the big reasons ridiculously low TTLs are used: it lets passive eavesdroppers discover a good approximation of your browsing history. Passive logging of HTTP has (fortunately) been hindered as most traffic moved to HTTPS, but DNS is still plaintext.

Low TTLs mean a new DNS request happens (apx) every time someone clicks a link. Seeing which domain names someone is interacting with every 60s (or less!) is enough to build a very detailed pattern-of-life[1]. Remember, it's probably not just one domain name per click; the set of domain names that are requested to fetch the js/css/images/etc for each page can easily fingerprint specific activities within a domain.

Yes, TTLs need to have some kind of saner minimum. Even more important is moving to an encrypted protocol. Unfortunately DOH doesn't solve this problem[2]; it just moves the passive eavesdropping problem to a different upstream server (e.g. Cloudflare). The real solution is an encrypted protocol that allows everyone to do the recursive resolution locally[3].

[1] https://en.wikipedia.org/wiki/Pattern-of-life_analysis

[2] https://news.ycombinator.com/item?id=21110296

[3] https://news.ycombinator.com/item?id=21348328

[+] mike_d|6 years ago|reply
> The author seem to be missing one of the big reasons ridiculously low TTLs are used: it lets passive eavesdroppers discover a good approximation of your browsing history.

I operate DNS for hundreds of thousands of domains. I've tried to reassemble browsing history from DNS logs, and I can tell you it is damn near impossible. You have DNS caches in the browser, the OS, broadband routers, and ISPs/public resolvers to account for - and half of them don't respect TTLs anyways.

The reason people set low TTLs is they don't want to wait around for things to expire when they want to make a change. DNS operators encourage low TTLs because it appears broken to the user when they make a change and "it doesn't work" for anywhere from a few hours to a few days.

[+] rgbrenner|6 years ago|reply
HTTPS does not hinder that type of tracking.. in fact, using SNI (which is unencrypted) would be more accurate than trying to do it with DNS... since it's sent with every request.
[+] mike_hock|6 years ago|reply
I'm sure the destination IP:443 tells about as accurate a story as the DNS lookups?
[+] rgbrenner|6 years ago|reply
I seem to remember a paper a few years ago that (IIRC) tested this by setting a very low TTL (like 60), changing the value, and seeing how long they continued to receive requests at the old value... and most updated within the TTL, but there were some that took up to (I want to say) an hour. I'm probably getting bits of this wrong though..

I did find this paper: https://labs.ripe.net/Members/giovane_moura/dns-ttl-violatio...

The violations in that paper that are important are those that have increased the TTL. Reducing the TTL increases costs for the DNS provider, but isn't important here. The slowest update was about 2 hours (with the TTL set to 333).

Of those that violated the TTL, we don't know what portion of those would function correctly with a different TTL (increasing the TTL indicates they're already not following spec). So I wouldn't assume that increasing the TTL would get them to abide by your requested TTL. They're following their own rules, and those could by anything.

Considering how common low TTLs are... you're worrying about a DNS server that's already potentially causing errors for major well known websites.

[+] belorn|6 years ago|reply
It is important to note that this study used active probes asking selected recursive resolvers around the world.

From my own experience when changing records and seeing when the long tail of clients stops calling the old addresses (with the name), it is a really long tail. An extreme example that lasted almost six months was a web spider that just refused to update their DNS records and continued to request websites using the old addresses.

Is there a lot of custom written code that does their own DNS caching? Yes. One other example is internal DNS servers that shadow external DNS. There is a lot of very old DNS software running year after year. Occasionally at work we stumble onto servers which are very clearly handwritten by someone a few decades ago by people with only a vague idea of what the RFCs actually say. Those are not public resolvers of major ISPs, so the above study would not catch them.

Naturally if you have a public resolver where people are constantly accessing common sites with low TTLs then issues would crop up quickly and the support cost would get them to fix the resolver. If it's an internal resolver inside a company where non-work sites are blocked then you might not notice until the company moves to a new web hosting solution and suddenly all employees can't access the new site, an hour later they call the public DNS hosting provider, the provider diagnoses the issue to be internal of the customer's network, and then finally several hours later the faulty resolver gets fixed.

[+] tzs|6 years ago|reply
> Of course, a service can switch to a new cloud provider, a new server, a new network, requiring clients to use up-to-date DNS records. And having reasonably low TTLs helps make the transition friction-free. However, no one moving to a new infrastructure is going to expect clients to use the new DNS records within 1 minute, 5 minutes or 15 minutes. Setting a minimum TTL of 40 minutes instead of 5 minutes is not going to prevent users from accessing the service.

Note that you can still get the benefit of a low TTL during a planned switch to a new cloud provider, server, or network even if you run with a high TTL normally. You just have to lower it as you approach the switch.

For example, let's say you normally run with a TTL of 24 hours. 25 hours before you are going to throw the switch on the provider change, change the TTL to 1 hour. 61 minutes before the switch, change TTL to 1 minute.

[+] devnulloverflow|6 years ago|reply
Wouldn't you be canarying your switch over a period of longer than 24 hours anyway?

I can still imagine a benefit to short TTLs in the sense that you can maybe roll out your canary in a more controlled way. But that's a lot more complicated than the issue of quick switching.

[+] zamadatix|6 years ago|reply
It would have been interested to see actual delay rather than qualitative results of the nature "<x>% wasn't in cache so this is horrible!". Admins and users don't care if it's in cache, they care what the impact to operations and load time is. https://www.dnsperf.com/dns-speed-benchmark says lookup times for my personal domain results in 20ms-40ms. Ironically the same dns test for 00f.net is taking 100ms-150ms.

99% of apps will gladly trade a 30ms increase in session start (assuming the browser's prefetcher hasn't already beaten them to it) to not have to worry about things taking an hour to change. Not all efficiency is about how technically slick something is.

[+] belorn|6 years ago|reply
I just tested 00f.net and got as low numbers as 6ms. Latency is a question about network traffic between the client and the server, and unless you use anycast you will get different latency depending on what place in the world the client and server reside in, and if you use anycast it depend on how good the contracts and spread the anycast network has.
[+] Kudos|6 years ago|reply
> I’m not including “for failover” in that list. In today’s architectures, DNS is not used for failover any more.

I mean, my company does this for certain failure scenarios involving our CDNs. Can anyone tell me why we're idiots, or is this just hyperbole?

[+] lykr0n|6 years ago|reply
This is very common (Dynect, NS1, AWS, GCP, etc all depend on this for monitoring and failover). The author is an incorrect.

amazon.com, reddit.com, facebook.com, and others use low TTLs on their domains for this reason. Anyone who can't maintain an anycast infrastructure around the world and doesn't want to depend on Cloudflare will use this method.

[+] bristolianthrw|6 years ago|reply
For literally my entire career in SRE, well over a decade now, I’ve only interacted with systems that use DNS for this purpose, from small shops to parts of every HN reader’s life. That sentence in your quote, as well as the assertive nature of the post on such a weak foundation, are sufficient to disqualify a hiring candidate on account of lack of experience despite the authored software presented. It simply does not align with reality when presented with two logically separate networks and a required mechanism to transition between them.

The only other alternative for that scenario is using anycast addressing, and that has a colorful bag of limitations that are quite different from those of low-TTL DNS (including being out of reach for most).

[+] GordonS|6 years ago|reply
DNS failover is used extensively, especially in the cloud world.

I see nothing wrong with using low DNS TTLs for failover - really don't understand the author's objections here, and them claiming that "DNS is not used for failover any more" significantly discredits them, IMO.

[+] EB66|6 years ago|reply
> I mean, my company does this for certain failure scenarios involving our CDNs. Can anyone tell me why we're idiots, or is this just hyperbole?

I came here to say exactly that. Our company uses DNS entries with low TTL for failover and load balancing purposes as well -- it's a very common approach. Services like AWS Route 53 and CloudFlare make it very easy to setup and low cost. I was surprised that the author didn't give much acknowledgement to this type of usage.

[+] StreamBright|6 years ago|reply
Don't worry many companies use DNS for failover (including Amazon).
[+] rodgerd|6 years ago|reply
You aren't idiots if you're using it where there are no better alternatives - it's preferable to use load balancers etc where available, but there are places where it's very much "DNS or nothing".
[+] rini17|6 years ago|reply
That can be acceptable price for minimizing the impact of accidental DNS misconfiguration. Which probably happened to every sysadmin.

Or is there a better way to quickly invalidate DNS caches in case of emergency?

[+] bristolianthrw|6 years ago|reply
No, there isn’t. The specification as implemented requires no invalidation mechanism, which means no such mechanism across all caches exists, nor will it ever. The long tail kills you in such a failure scenario, and remember, people who make kitchen appliances write DNS resolvers.
[+] cortesoft|6 years ago|reply
Yeah, this was my first thought...I am guessing the author has never accidentally pushed out a bad DNS entry and needed to revert/update.

Everyone probably starts with higher TTLs, then the first time they mess up an update they switch to a short one.

The author seems to not appreciate how big a problem a misconfigured, long TTL DNS entry could be for someone.

Other reasons for short TTLs... maybe an IP gets blocked/flagged by a large network and they need to change fast... or a network path is slow and they need to move to a new location.

[+] vitalysh|6 years ago|reply
"The urban legend that DNS-based load balancing depends on TTLs (it doesn’t)"

So whats the solution? We are using AWS ALB/ELB and it states in docs, that we should have low TTL, and it makes sense. Servers behind LB scale up and down. What is the option B?

[+] sandinmyjoints|6 years ago|reply
In fact, if you use Route 53 with an alias to an ELB, the TTL is hard-coded at 60s -- it is not even configurable. If it were, we'd follow the practice of lowering it prior to changes, and raising again once things or stable, but as it is, that's not an option (moving DNS off AWS would be a hard sell, not cause it's terribly hard but afamic, there's not really any value to doing it).
[+] jsizzle|6 years ago|reply
I would maintain that if you are experiencing poor performance for a web site, there are MUCH more fruitful places to look than DNS latency. Third party objects, excessive page sizes, lack of overall optimization based on device are just the tip of the iceberg.
[+] edoceo|6 years ago|reply
For many apps I've worked the DB connection setup was always the slow part (use PgBouncer). Then, the part was the queries. DNS, gziped CSS/JS - chasing a red-herring.
[+] vitus|6 years ago|reply
> Here’s another example of a low-TTL-CNAME+low-TTL-records situation, featuring a very popular name:

> $ drill detectportal.firefox.com @1.1.1.1

Is captive portal detection not a valid use case for low TTL? The entire point is to detect DNS hijacking of a known domain, which takes longer when you cache the DNS results...

[+] xfitm3|6 years ago|reply
I run authoritative DNS for a very busy domain - 30B queries per month. Originally we had 6 hour TTLs, but now I use 60s. We have had no problems. Uptime and fast failover comes before anything else.
[+] justinsaccount|6 years ago|reply
There was a dns record looked up primarily by large supercomputers that had a 0 ttl. It was used for stats via a UDP packet (because it was non blocking, nevermind that the dns query was blocking). This was set to 0 for "failover" but it hadn't changed in years. I worked out that our systems alone had caused billions of queries for this name.

After I complained I think they upped the ttl.. to 60.

[+] zamadatix|6 years ago|reply
Reminds me a of a server pair at the last healthcare place I worked. Between the two of them they'd generate something around 1200 DNS lookups per second (about 60% of the load on the DNS servers) of their own name. I think the logic was if the name stopped responding then server A was primary. If the name was responding the server that owned the IP it was responding for was primary. If the servers wanted to swap primary/secondary they would issue a DDNS request.

After about 8 years we were were restructuring our DNS infrastructure for performance and I rate limited those two to 10 or so queries per second each. In that time there must have been 300 billion or so requests from those two boxes alone.

[+] vsviridov|6 years ago|reply
Just anecdotally, from running PiHole, looking at the logs, I have some sites being resolved 12K times over 11 days... That's over a thousands requests a day.
[+] jlgaddis|6 years ago|reply
Running

  $ echo min-cache-ttl=300 | \
      sudo tee /etc/dnsmasq.d/99-min-cache-ttl.conf
will likely cut down on the number of forwarded queries by a large amount. Adjust the value (in seconds) to your needs.

Don't forget to run

  $ sudo pihole restartdns
afterwards.
[+] tallanvor|6 years ago|reply
Probably ad or analytic sites? I know one app in particular where every other update seems to result in it sending a request per second to some blocked analytics site.
[+] dorset|6 years ago|reply
Can anyone explain why ping.ring.com needs to have such a low TTLs?

  =-=-=-=-=
  $ drill ping.ring.com @1.1.1.1
  ;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 36008
  ;; flags: qr rd ra ; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 0
  ;; QUESTION SECTION:
  ;; ping.ring.com. IN A
   
  ;; ANSWER SECTION:
  ping.ring.com. 3 IN CNAME iperf.ring.com.
  iperf.ring.com. 3 IN CNAME ap-southeast-2-iperf.ring.com.
   
  ;; AUTHORITY SECTION:
  ring.com. 573 IN SOA ns-385.awsdns-48.com. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400
  =-=-=-=-=
I've been trying to find out from Ring support for a few days, and while the support layer has been trying to find out, not much information seems to be getting back. To put this in perspective, in my house with two Ring devices (a doorbell and a chime), I am getting 10,000+ DNS un-cached requests a day, which easily is 20x more than the second most requested domain.
[+] zamadatix|6 years ago|reply
The command output has the answer, it's a CNAME to whatever random AWS instance happens to be up and running. They probably let the instances autoscale to load and don't guarantee they'll be around for any amount of time and rather than configure an additional service for heatbeating they just used DNS.

There are caching nameservers that allow you to override the minimum TTL but be aware the device is likely relying on this being immediately up to date and may not work during a change with an extended TTL set.

[+] Twirrim|6 years ago|reply
That's actually a pretty long TTL by Amazon standards.

Amazon.com:

    ;; ANSWER SECTION:
    amazon.com.             60      IN      A       176.32.103.205
    amazon.com.             60      IN      A       176.32.98.166
    amazon.com.             60      IN      A       205.251.242.103
Or some AWS services

Glacier:

    ;; ANSWER SECTION:
    glacier.us-east-1.amazonaws.com. 60 IN  A       54.239.30.220
S3 has even shorter:

    ;; ANSWER SECTION:
    s3.ap-northeast-1.amazonaws.com. 5 IN   A       52.219.0.8
Or, say, DynamoDB:

    ;; ANSWER SECTION:
    dynamodb.us-east-1.amazonaws.com. 5 IN  A       52.94.2.72

The main reason to do so is to be nimble, it's to be able to react to incidents as fast as you can and change, and to make potential deployment patterns possible.

From time to time, you need to do something with customer facing infrastructure: Remove the DNS entry, watch the traffic drain over the next 5-10 minutes, and then do what you need to do on the device, test, and then add it back in the DNS again, from which you can watch traffic return to normal levels and verify everything is good.

[+] megous|6 years ago|reply
Thankfully this is one of those things, that you don't need to respect. TTLs are just suggested values in the end (the standard may disagree).

I just checked, and I actually have TTL forced to 1 day in dnscrypt-proxy. My internet experience is fine. I guess I never noticed in the last 2 years or so.

[+] gpm|6 years ago|reply
Why does DNS cache expiration need to be in the critical path?

Instead of a browser doing

1. Local DNS lookup (resulting in expired entry)

2. DNS query

3. DNS response

4. HTTP request

why not do

1. Local DNS lookup (resulting in expired entry)

2.1. DNS query

2.2. HTTP request

3. DNS response

4. If DNS response changed and HTTP request failed, HTTP request again

Maybe use two expiration lengths, one that results in flow 2 and a much longer one that results in flow 1.

[+] dsp|6 years ago|reply
Ya, this is roughly what the fb apps do. Dns rarely blocks, and changes are seen quickly.
[+] teddyh|6 years ago|reply
Probably because the gain in milliseconds is not worth the code complexity of parallell executing code.
[+] MayeulC|6 years ago|reply
Well, in my case it makes sense, I think: I host my server at home, and have a dynamic IPv4. I don't know when it could change, so I just set the TTL to something low.

Since the traffic is low, though, I can afford to check for an IP change every ~5min, and although I set a TTL of ~15min on most services, the main CNAME (ovh-provided dynamic dns service, TTL set by them) is set to 60s.

My IPv6 record was set to 1h, but I'll look into increasing it. It is true that my mobile phone often pings my server, so I imagine that it could reduce the battery usage.

[+] jimnotgym|6 years ago|reply
Please excuse any ignorant use of terminology, I am not a DNS expert like others on here, but I can share some experience in the smaller business world.

A company I worked with a couple of years ago was using Dyn as their DNS provider, and one day we got a notifcation that we had passed the usage limits for our account. This seemed impossible considering our site was getting a couple of hundred unique visitors a day. A few things came out of the analytics.

1) A short TTL on an A record had been left on from a website migration project. The majority of the requests were coming from our internal website administrators. I moved it up to a couple of hours and this went a way.

2) We were getting a huge amount of AAAA record hits. I think most modern browsers/OS try quad A first??? We didn't have IPv6 configured, and therefore the negative resolution had a TTL set to the minimum on the SOA record, which was 1second! A change of this to 60 caused a huge reduction in requests. I suppose I should have set up ipv6, but I didn't.

3) When we sent out stuff to our mailing list the SPF (or rather TXT) records saw a peak that was off the chart. We had a pretty settled infrasructure, so I moved that TTL to a day (I think from memory) and it flattened the peak somewhat.

4) There was a large peak in MX request around 9am. I put this down to people opening their email when they got to work and replying to us. I had to set the TTL to a couple of days (of course) to smooth that one.

I like to think it was worthwhile and improved things for users. I at least had a nice warm glow that I had saved the internet from a bunch of junk requests, and it just felt tidier.