Why DNS-based Global Server Load Balancing does not work [2004]

[+] michaelcampbell|15 years ago|reply

If memory serves, Netflix users discovered that Akamai is using this strategy and everyone coming in from Google's DNS or OpenDNS was getting put into the same 'pool', and overloading the pipes in that pool. The recommendation was to use your ISP's DNS at least for your Netflix devices.

Or have I misunderstood?

(Post should probably have had the date listed; just realized this article is 7 years old. That doesn't make it wrong or bad, mind you...)

[+] davidu|15 years ago|reply

This isn't true for Google. Akamai has all kinds of archaic network issues that cause problems, but it isn't often OpenDNS related. They certainly do their best, and know our network map well as we send it to them whenever it changes.

For Google -- If you use OpenDNS, you will always go to the most appropriate Google datacenter to you (not OpenDNS).

[+] IgorPartola|15 years ago|reply

Fixed the title. The sad thing is that not much has been invented since then to aid with global load balancing.

[+] aeden|15 years ago|reply

It's worth pointing out that this article is a bit long in the tooth. Although perhaps biased responses, it's worth looking at some other thoughts on the subject:

http://devcentral.f5.com/weblogs/macvittie/archive/2011/03/2...

http://dev.robertmao.com/2007/06/30/global-dns-load-balancin... (towards the end, and yes, the grammar is bad)

http://blogs.sun.com/davew/entry/thoughts_on_global_server_l...

[+] xtacy|15 years ago|reply

The main reason seems to be clients caching DNS lookups for a period of time longer than what they should have, as indicated by the TTL field in the DNS response. If this behaviour changes across all browsers, wouldn't that solve this problem?

Another "hack" to solve the caching problem would be to have multiple random lookup records (say server-$random.hostname.com) all result in multiple lookups that cannot be cached? The tradeoff here is latency vs availability.

As mentioned in the article, triangulation and backup redirection would work as long as "Site A" can be up serving requests labeled (1).

[+] timdorr|15 years ago|reply

What about using Anycast IPs? http://en.wikipedia.org/wiki/Anycast

[+] jemfinch|15 years ago|reply

Routing changes break the network connection. If it's buffered and you can reconnect transparently, it might work, but for unbuffered connections (like most web connections are) you'll see increased error rates.

[+] IgorPartola|15 years ago|reply

I think one solution to this is to add a DNS record that says how to set the connection timeout for a given port for this IP before trying the next one. As is, there is no control over that and each browser implements it differently.

[+] davidu|15 years ago|reply

SRV records could do this, but most browsers can't do DNS. Only Chrome can.

[+] yesbabyyes|15 years ago|reply

Since the Amazon outage, I've been wondering about the best way to quickly switch to another host. Having multiple A records is so obvious I can't believe I have neither thought about it nor heard of it before (I've seen it implemented, but never thought about as a failover). I didn't know the browser would switch to the next IP. Interesting!

That said, can you control which host the browser will try first? I.e. can I have 1.1.1.1 as my main host and trust that all clients will connect to that as long as it's up, and only connecting to my backup 2.2.2.2 when 1.1.1.1 doesn't respond?

[+] prakash|15 years ago|reply

Since the Amazon outage, I've been wondering about the best way to quickly switch to another host.

This is exact problem we (Cedexis) solve.

Assume your site is www.website.com, in general points to an A record or a CNAME to a datacenter/ cloud/ CDN.

We add an intermediate hostname, low TTL (20 seconds), global anycast network, which can be scripted (write your load-balancing logic in php) to handout a CNAME (one of many) based on performance (RTT), load, cost, anything else you can think of, etc.

Re.: AWS's recent outage, assuming you are running your apps in multiple zones/ regions/ clouds, we would have noticed the latency and automatically routed away to a different zone/ region/ cloud.

We collect hundreds of millions of performance measurements daily: http://gigaom.com/cloud/heres-what-amazon-outage-looked-like...

Drop an email and I am happy to explain more and setup folks from HN with a free account: prakash at cedexis.com

[+] justincormack|15 years ago|reply

It should work though if you use elastic IPs across different AZs in Amazon, or if you use geo ip and always send one IP for a different region that is close. Not clear if it is the best solution for Amazon though.

[+] mrcalzone|15 years ago|reply

I have been wondering about the same thing. Sadly, the article says:

>but returning multiple A records diminishes any possibility of deterministic site selection.

[+] mike_esspe|15 years ago|reply

I tried it, you can't control it, because some client's dns randomize such records.

31 comments