top | item 11353617

DNS Outage at DigitalOcean

120 points| finne | 10 years ago |status.digitalocean.com

121 comments

order
[+] tonylemesmer|10 years ago|reply
People hating on DO "I'm losing thousands every hour". Well then should have had some failover in place if its that valuable.

[1]https://twitter.com/rodrigoespinosa/status/71303563702097100...

[+] crisopolis|10 years ago|reply
I've been reading all the comments on Twitter also... like "err mai gawd I'm switching to AWS because of this" and your failure to not have a secondary DNS provider, but I highly doubt you'd switch.

Then another... "Today's @digitalocean DNS #outage is a reminder to not trust your entire business to one provider. Spread the love around!"

If your company is e-commerce and makes money by being 99.99% available. It's your own fault for no fail-over.

another... ".@digitalocean that's two hours without DNS now...my company's websites could be losing thousands of £ in e-commerce! Please, an update!"

[+] colinbartlett|10 years ago|reply
I can't disagree with what you're saying, but I think we are all guilty of this. We expect more out of big name services than might be reasonable. (100% uptime)

How many of us here have failover email services in case Gmail goes down? I think many companies would say they'd lose thousands in productivity if Google Apps suffers an outage yet I'd hazard that very few have failover plans.

[+] IgorPartola|10 years ago|reply
I use dns.he.net for my DNS hosting. It's free up to 50 zones and has been rock solid. The other day I started having some trouble with accessing some of my domain names. Turned out that all of their DNS was down and was returning NXDOMAIN for pretty much any request, including their own domains. Oops. So I emailed their support (which is usually very quick to respond and is better than I have seen with lots of paid products). Well, it then occurred to me that I will not get a response since the MX records for my domain were also hosted with them. Double oops.

On the plus side, in the past 4 years that I've used them this was the very first issue, and they fixed it within a couple of hours.

Anyone have any good recommendations on cheap or free backup DNS hosting?

[+] cleaver|10 years ago|reply
I had one customer on DO DNS and it was a "good enough" solution. Unfortunately, this came right in the middle of a marketing push for last-minute registrations. An annoyance, but not a major financial impact. (Maybe it will give the impression of excess demand. :)

I understand that things break and I should be ready for it. What I found unacceptable were the status updates. Basically, "we're working on it". No clue as to what was going on. A DDoS? Not a DDoS? Routing issues? Corrupt zone files? No clue? Any of those would be helpful as I needed to figure out if I should wait it out, or switch to Route 53.

In the information vacuum, I switched to Route 53. It works.

[+] scurvy|10 years ago|reply
How do you fail over your SOA on .com when the minimum TTL is 1 day?
[+] chronid|10 years ago|reply
DNS is hard. Very hard.

It may seems trivial when it works (hint: it's not), but some of the biggest fuck ups I've seen in my professional life were caused by strange DNS things happening or DNS servers going kaboom.

I feel the pain of the DO engineers trying to mitigate this issue. I really do.

[+] johansch|10 years ago|reply
BS. DNS is a trivial thing to scale, compared to most other web-scale efforts.

Things break when people don't use 20 year old best practices. There is no defense against inexperience and ignorance.

[+] dividuum|10 years ago|reply
> I feel the pain of the DO engineers trying to mitigate this issue. I really do.

Me too. Just last week they had another problem with DNS on the client side of things: Resolving with the Google Public DNS, which most droplets use by default, didn't work reliably. I hope that they post a combined post mortem for both of those incidents.

[+] Thaxll|10 years ago|reply
It's not hard, the problem is everything relies on DNS so when DNS goes down or has problems you have cascading failure.
[+] tyingq|10 years ago|reply
One thing hosting providers could do better would be to split the risk a little by not handing the same dns server name to every client that chooses to have the hosting provider supply dns services.

The reason this might have some upside is that DDOS attacks against a specific DNS server are often intended to target one specific customer of a hosting provider. The attacker doesn't care about the side effects...just the original target.

Say, for example "controversialblog.com" is hosted on DO, and uses DO dns servers. The person attacking "controversialblog.com" looks up the NS records for the domain, and attacks that DNS server. The fact that it's one hostname that serves all of DO is of little interest to the attacker.

So, if DO would come up with say, 10 separate hostnames they could hand out, then this sort of thing would take down 10% of their customers instead of 100%.

[+] traviswingo|10 years ago|reply
Yeah this is pretty unfortunate. We have some big investor meetings today and this unfortunately took our marketing site offline. Hopefully they resolve this soon - it's the first time we've ever experienced an issue with their service.

We really need fail-overs in place...small team problems.

[+] brndn|10 years ago|reply
If you are showing a demo or something, you can still navigate to your DO IP address. Of course, I don't know if other things (images, etc.) on your website also rely on their DNS.
[+] defenestration|10 years ago|reply
We feel the pain as well as our platform is unreachable. I'm now using an other DNS server and changed the nameserver in the domain-record. However the DNS propagation is taking some time. What are you doing at the moment as fail-over?
[+] sashk|10 years ago|reply
My rule: provider should do single thing:

- Hosting provider - host sites

- vps/cloud provider - provide VMs

- domain registrar - domain related stuff, but not DNS

- dns provider - host dns

- second dns provider - host dns in case first dns provider fails

So many DNS outages recently and all my projects are up.

[+] copperx|10 years ago|reply
Does Amazon's Route 53 count as a DNS provider, or do you treat it a hosting provider?
[+] nlivingstone|10 years ago|reply
Have multiple VMs @ Digital Ocean (TOR1), we use Cloudflare for DNS... All site have remained available and successfully fulfilling requests.
[+] fredophile|10 years ago|reply
I don't have anything more important than a small personal website but now I'm curious. If you set up a system to handle your main DNS provider failing, how do you test it? Is there a good reference where I can find some best practices on this?
[+] mrideout|10 years ago|reply
Here's my testing recommendation:

1. Pick some subset of your DNS records to monitor, or all of them if you want to be extra thorough. If you are picking a subset, then I'd pick whatever records are most critical to your business.

2. Setup monitoring that queries each of your authoritative name servers for each of the records that you identified in the previous step. The monitoring should notify you if any of the name servers are unresponsive, or return a different response than what's expected.

If you'd like to dig into the details of DNS, then O'Reilly's "DNS and BIND" is highly recommended, even if you're not using BIND.

There are a number of quality hosting providers out there. A rule of thumb that I use is this: If a DNS hosting provider doesn't eat their own dog food, don't trust them to handle your DNS. Digital Ocean doesn't use their own name servers for their main website's domain. Neither does Amazon.

Shameless plug: I created a DNS monitoring service that can be used used for monitoring each of your name servers: https://www.dnscheck.co/

[+] Rezo|10 years ago|reply
Their status page at https://status.digitalocean.com is also now giving an intermittent "500 Internal Server Error" nginx error, probably from the load. That's why you should use a service like https://www.statuspage.io for your important stuff, even though creating a status page is a fun side-project for a dev team.
[+] crisopolis|10 years ago|reply
So what you're saying is that instead of running their own Status Page on their own infrastructure that's reachable. They should outsource it to statuspage.io and pay another company to do it?
[+] clentaminator|10 years ago|reply
But who monitors the status of statuspage.io?
[+] karlgrz|10 years ago|reply
This is the first DNS outage I've experienced with them in 3+ years, then again I host everything in their NY regions.
[+] crisopolis|10 years ago|reply
I've never experienced an outage of any kind with DO, so also first time. I also host all my droplets in the NYC regions.
[+] josh_carterPDX|10 years ago|reply
Same. They've been pretty reliable. Hoping this doesn't last long or we'll be looking to move off.
[+] josh_carterPDX|10 years ago|reply
I think the most annoying aspect of this outage are their updates. Three updates and they all say the same thing with no meaningful information as to what's causing this. Likely they may not have much information, but you'd think there would be something more than what they've been posting for the past hour. Good times!
[+] coreyp_1|10 years ago|reply
Does anyone know of a good strategy for DNS failover?
[+] jamescun|10 years ago|reply
I would be interested in the post-mortem from this. While DigitalOcean operate their own DNS, it is only made publicly available though CloudFlares DNS proxying service.
[+] nodesocket|10 years ago|reply
Recommend AWS Route53 very highly. Route53 also allows you to buy domain names and do lot's of fancy fail-over, geolocation, and CNAME alias at the apex magic.
[+] samgranieri|10 years ago|reply
A few years ago Slicehost had a DNS outage and the webscrapers I had running were falling over because they couldnt resolve DNS. I had to SSH into 8 boxes and update resolv.conf to add google DNS and openDNS as a backup. (Yes, I should've had centralized config management with chef or puppet or ansible)
[+] doublerebel|10 years ago|reply
No offense to anyone here, but what is DO's SLA? Last time I looked, they did not have one.

DO is cheap for a reason. And that's the same reason I don't host with them, I can get SLA-backed infrastructure for a reasonable price and would have no excuse to my customers or cofounders.

[+] xir78|10 years ago|reply
We have seamless DNS "failover" by running dnsmasq with the all-hosts option on all our servers. It causes dnsmasq to query all at once so if any go down its transparent to our apps. Works perfectly on our 1500 ec2 instances.
[+] r1ch|10 years ago|reply
I thought their DNS was supposed to be rock solid since they use Cloudflare Virtual DNS. Oh well, lesson learned. Back to running my own DNS servers on each droplet, if the DNS is down the droplet is likely down regardless :).
[+] showerst|10 years ago|reply
Feeling the pain here too. What DNS providers do others use and like? Route53?
[+] rbritton|10 years ago|reply
The sites I have that are actually up right now are those routed through CloudFlare.
[+] stevekemp|10 years ago|reply
Route53 is hard not to love; simple to develop against and very very reliable.

I wrap it in git to make updates more straightforward for people unfamiliar with AWS, but even using it directly is very simple from multiple languages. (https://dns-api.com/)

[+] dboreham|10 years ago|reply
Bind, running on VMs. Not hard.
[+] joejoebob|10 years ago|reply
Where I work we use Rotue53. For my personal domains I just use my registrar, Namecheap.
[+] hornbaker|10 years ago|reply
dnsmadeeasy for around 8 years now