top | item 27920665

Akamai Edge DNS was down

465 points| vhab | 4 years ago |edgedns.status.akamai.com

218 comments

order

geocrasher|4 years ago

People don't believe me when I say how much DNS matters. So I wrote a song about it.

https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...

dolni|4 years ago

> People don't believe me when I say how much DNS matters.

That's weird to me. I have been working in sysadmin/DevOps for over a decade, but it did not take me very long to learn that DNS outages cause massive problems.

wpasc|4 years ago

dns, DNS, dns, dns. The start of every process, dns.

Love this.

southerntofu|4 years ago

Sounds amazing! Do you maybe have a direct link? Soundcloud doesn't want us privacy-conscious users browsing their website :(

ricardo81|4 years ago

Brilliant. NOERROR for this.

Frost1x|4 years ago

This made my day, thanks!

mvanbaak|4 years ago

Awesome! Thank you.

pololee|4 years ago

Thank you!! lol

brianjking|4 years ago

lol, thanks for the laugh.

patleeman|4 years ago

I just teared up

dbsmith83|4 years ago

https://downdetector.com/archive

So many sites down... and unfortunately not one of them is Twitter

cpgeier|4 years ago

Amazing that down detector manages to stay up during these kinds of outages. Noticed it has been a little slow but they really have done a good job keeping it up even though large portions of the internet is down right now.

mcintyre1994|4 years ago

It's interesting that they report an AWS outage but there don't seem to be any issues there. Looks like their methodology is a bit too reliant on those speculative tweets from the first 5 minutes of all these sites going down. https://downdetector.com/status/aws-amazon-web-services/

> So many websites are down, are AWS servers down or something?

> Amazon web services is down which is affecting a lot of company web sites and services. Not sure what is going on.

> Miss us? @aldotcom and a whole bunch of other folks have been knocked off the internet by what appears to be an AWS attack/system failure. We'll be back. ?

grawprog|4 years ago

You got your wish, looks like Twitter's on the list now too.

dheera|4 years ago

Is there a way to tell your system to fall back to the last known IP address if DNS server isn't reachable?

Basically soft-invalidate your local DNS cache but it back from the cache graveyard if DNS is down.

1f60c|4 years ago

> Unfortunately not one of them is Twitter

Please keep comments like this off HN

tjpnz|4 years ago

Just got booted out of Netflix on the PS4 because the console could no longer connect to Sony's license server. Netflix was working just fine by the way.

lxgr|4 years ago

Was the app installed/running using a secondary PSN account by any chance? This shouldn't be happening on a primary account/console pair.

vmception|4 years ago

Ah thats whats going on. Happened to me as well, I just assumed that Sony is neglecting PS4 performance with its new system, while bogging it down with bloatware.

hackerbrother|4 years ago

Yup, I learned Hulu on Xbox One relies heavily on some Microsoft authentication during a recent Office 365 or Azure outage (not sure which).

tyingq|4 years ago

You can see this on a lot of sites right now. You get the Akamai style error with something like:

  Reference: #11.453a2f17.1393u44848484.3aee33433
At the bottom of a very bland looking error page.

halfmatthalfcat|4 years ago

You could argue Akamai is the blandest of the CDN bunch; their UIs are atrocious.

lowbloodsugar|4 years ago

What's frustrating is that DNS is returning an address, instead of just failing, and so macos is caching that value (though it might be cloudflare doing that).

cbeley|4 years ago

I wonder if this is why LastPass is down. It has completely locked me out of my vault. You'd think it'd continue to work offline in a case like this. :/

eunai|4 years ago

I switched to BitWarden and haven't looked back. You can use it on the phone and pc (browser). As well as a desktop client.

zxcvbn4038|4 years ago

When it comes to password managers, 1password is the one to beat. Much better experience in every regard.

davidjgraph|4 years ago

Serious question, has anyone properly solved the issue of DNS as a single point of failure?

sakisv|4 years ago

Depending on what point you draw the line of "single point of failure" you could use multiple providers for your dns.

GOV.UK for example uses both aws and gcp for DNS

grishka|4 years ago

And then there are Cloudflare and other Centralized Downtime Networks as another point of failure.

citrin_ru|4 years ago

It is relatively easy to make DNS highly redundant: just put multiple DNS server in data-centers which are as independent as possible (different geo locations, different ISPs). You can also use different DNS software and different OS (say BSD+Linix) to exclude correlated bugs. Root DNS server AFAIK use different software for this reason.

Problems starts when you want to easy make frequent changes and introduce complex software to manage DNS zones (and complexity usually comes with bugs).

hk1337|4 years ago

The problem isn't DNS though, is it? The problem is that people don't necessarily use the redundancies on DNS?

The whole reason it takes a domain 24h to fully work with DNS is because it propagates the information other DNS servers, thus making not be a centralized service.

tyingq|4 years ago

It's an interesting question, as it's always been solved on the server side. All of the current problem is client side. That is, client resolvers that aren't using diverse providers, and only do things like round-robin with long timeouts.

toddh|4 years ago

You can still hardcode IP addresses. Not sure most people realize DNS isn't actually needed, you know, except for convenience and all that.

topranks|4 years ago

It’s one of the most successful, global, distributed databases of all time.

What’s the single point of failure?

foobarbazetc|4 years ago

Absolutely amazing how many billion $+ companies are single homed for DNS.

I wonder how much they spend on multi-AZ redundant architectures...

orblivion|4 years ago

So here's a weird question: Supposing companies multi-home for DNS, or whatever other essential service, via multiple service providers.

Whatever multi-home means, why can't there just be one service provider that does that? And are we sure that these service providers aren't already doing that as best we might hope for? (For instance, Amazon already has multiple zones, etc.)

I suppose the one thing this can't protect against is some sort of political (broadly defined) threat related to the company itself.

toast0|4 years ago

Using multiple providers for mostly static DNS is easy, pick one as primary and AXFR to the other and notifications and whatever. Or it's not too hard to keep a zone file in source control and sync it to the providers.

Using multiple providers for fancy DNS, like only providing IPs that pass healthchecks or geotargetting users to datacenters gets pretty hard, because the different providers have similar capabilities, but no uniform interface, so you've either got to do it manually, or you have to build out your own abstraction that is probably limiting.

If possible, insourcing DNS makes the most sense to me, because if you can't keep your service online, it's not the worst if your DNS is offline; and if you can keep your service online, you probably won't mess up your DNS too badly.

nexuist|4 years ago

Might be survivorship bias. Multi-AZ arch protects against all other failures, so the only one that remains visible to the outside world is DNS.

zxcvbn4038|4 years ago

Most CDNs offer huge incentives for sending them more traffic, a lot of time you end up in a contract obligated to handle X requests and Y gigabytes of traffic per month. But personally I believe you should never have a single provider for anything - particularly when it’s acceptable for a company to cut you off with no warning or recourse.

topranks|4 years ago

Problem is, if your on Akamai’s CDN, only Akamai know where the local caches are. You need to be on their DNS only.

delgaudm|4 years ago

Lastpass is down, so if you use lastpass the effect is significantly compounded.

mcintyre1994|4 years ago

Do they not cache everything locally? I'd have thought a password manager/secure data store would work offline.

nonfamous|4 years ago

It still works in offline mode. You can’t update passwords, but you can retrieve them.

lowbloodsugar|4 years ago

So many sites being reported as down, but change your DNS to something else (e.g. Google 8.8.8.8 and 8.8.4.4) and, after flushing your DNS cache, the sites are available. I was unable to get to ups.com or newegg.com (why yes, I am expecting a new toy), but after switching DNS and flushing DNS cache, I was able to get to both.

Specifically, 1.1.1.1 provided bad addresses (as opposed to no addresses), and removing 1.1.1.1 fixed my problem. By then it had returned a bunch of bad addresses and I had to flush my DNS cache.

aix1|4 years ago

Could you give an example of what you mean by a "bad address" in this context?

thunfisch|4 years ago

Yep, all our EdgeDNS zones as well as DSD edgekeys are just returning SERVFAILS. Many big german websites are down right now.

zhdc1|4 years ago

Several unrelated websites I was trying to visit are down. I figured I would find the answer on HN : )

knaik94|4 years ago

I am surprised financial institutions don't have any regulation for redundancy. The one that stuck out to me is the Navy Federal Credit Union website being down. I have not had any issues logging into mobile though for some of the reported sites.

deckard1|4 years ago

this is prime shit Hacker News says right here. Wait until you learn banks close on Sunday. Or have maintenance windows for their website, ATM, etc.

toomuchtodo|4 years ago

Commercial banks are held to a different operational resiliency standard than financial infrastructure.

(a component of my consulting work is reporting to financial regulators for institutions)

Terretta|4 years ago

> financial institutions don't have any regulation for redundancy

As CTO of a bank, I wasn’t aware of this. So either we wasted a ton of money and time constantly upgrading redundancy and business continuity technologies to satisfy our regulators… or this statement could be mistaken.

christophilus|4 years ago

I'm not sure how easy it would be to regulate. But yeah. I've got a few short term trades in my brokerage account, and outages really throw a wrench into those.

brentm|4 years ago

CapitalOne has a broken login which is pretty surprising to me.

cryptoz|4 years ago

All major Canadian banks were down.

cbono1|4 years ago

Why would Google and Amazon be on the downdetector list or experiencing issues? Don't they have their own DNS / nameservers separate from Akamai?

sathackr|4 years ago

because the way downdetector works is it just basically counts how many people are searching/visiting for <site> down and if it's much higher than typical it flags the site as down.

So if everyone searched "is google down" and visited the link on downdetector that was returned in the search, that would add to the downdetector count for that site.

Downdetector doesn't actually know if the site is up or down.

memco|4 years ago

Was just browsing a website where the first page of a query worked, but visiting page 2 of the results was returning a DNS error. Was curious how and why only part of the site was down, but it looks like this was the problem as now the whole site is down.

katbyte|4 years ago

aren't short DNS TTLs great?

sebyx07|4 years ago

The good parts of centralisation

schemathings|4 years ago

Possibly related .. Verizon peering issues / ASN701 at Equinix NY2 in Secaucus NJ

mvanaltvorst|4 years ago

What role does Akamai Edge DNS play in normal internet traffic? DNS responses usually get cached, as far as I understand correctly. And it is usually possible to change your DNS server to e.g. Google's and circumvent the outage. Does Akamai Edge DNS play a role on the server side?

uncertainrhymes|4 years ago

If you use a CDN to front your traffic, you need the CNAME for www (or whatever) to be pointing at their DNS infrastructure, so they can return whichever closest POP is going to serve your traffic.

e.g. dig @1.1.1.1 www.nvidia.com +trace

... various things from the root ...

www.nvidia.com. 7200 IN CNAME www.nvidia.com.edgekey.net. ;; Received 83 bytes from 208.94.148.13#53(ns5.dnsmadeeasy.com) in 35 ms

So the main DNS is fine, but it'll never get an A record because the last link in the chain is toast -- edgekey being Akamai in this case, but all CDNs do this so they can route traffic. Normally, this is a good thing so they can shift traffic within 30 seconds on their side. Unfortunately, it also means it would take nvidia an two hours to point away from Akamai.

carlsborg|4 years ago

Looks like this: the affected subdomains are CNAMEd to the akamai CDN, and the Nameserver for those are/were down.

So for example:

Top level domain for nvidia resolved fine..

dig @1.1.1.1 nvidia.com => status: NOERROR, Nameservers are ns6.dnsmadeeasy.com

But the website didnt. dig @1.1.1.1 www.nvidia.com => status: SERVFAIL,

The Nameserver for the this www.nvidia resolved to the akamai nameserver which had a problem..

dig @1.1.1.1 www.nvidia.com NS => CNAME e33907.a.akamaiedge.net.

r1ch|4 years ago

The trend these days are DNS TTLs of 60 - 300 seconds, to allow "Cloud agility" or something, so sites are exposed to a much larger risk of authoritative nameservers going down.

NeckBeardPrince|4 years ago

> What role does Akamai Edge DNS play in normal internet traffic?

Clearly a big one.

twalichiewicz|4 years ago

Posted this is the thread about the travel websites being down, but seems Fidelity is entirely impossible to sign in to / trade right now.

00deadbeef|4 years ago

Figured this out almost 30 minutes before they bothered to update their status page.

00deadbeef|4 years ago

Well it's been an hour now since I first noticed the effects and their service status still has no useful information or ETA for a fix. It's just an "emerging issue".

jonnyone|4 years ago

The affected sites that I use are now working. Check again.

testplzignore|4 years ago

Strange thing about the duration of this outage... From logs I have, it seems to have lasted exactly one hour, from 15:38 to 16:38. Their Twitter account also said "disruption lasted up to an hour", though they incorrectly said it started at 15:46 (did it take 8 minutes for their monitoring to notice?).

That makes me think that whatever the fix was, it had to wait for some one-hour cache to expire before it took effect. I'm very interested to find out what the cache issue was, more so than what the original bug was.

swarnie_|4 years ago

I love seeing these issues reverberate around the internet.

This time i think /r/sysadmin pegged the issue first, great sub.

nowahe|4 years ago

I'm in the middle of a migration from Akamai to Cloudfront, time to take a break I guess

soheil|4 years ago

App Store on MacOS is down!

aliswe|4 years ago

Not only that their support telephone line (in sweden) was down as well

xyzzy21|4 years ago

And people wonder why I try to avoid depending on online anything...

didjathinkmess|4 years ago

Cyberpolygon already? Thought we had at least a month or two

penultimatebro|4 years ago

Shh, normies are not ready for that.

It’s just a completely random DNS outage, nothing more.

SjorsVG|4 years ago

Many bank systems are disrupted by this in the Netherlands

ricardo81|4 years ago

My UK bank (HBOS) seemed to have 'online banking unavailable' though their site was up. No doubt related.

SjorsVG|4 years ago

Many banks in the Netherlands are affected by this.

tru3_power|4 years ago

Any idea on cause? Ddos or hardware failure?

MrRadar|4 years ago

Widespread issues like this on major CDNs tend to be configuration errors.

_joel|4 years ago

So that's why the NHS website is down

jdlyga|4 years ago

Oops, someone unplugged the DNS machine

blondie9x|4 years ago

Looks like it is fixed now!

bpye|4 years ago

This is apparently why I can't book my COVID vaccine appointment...

_joel|4 years ago

Yes, was trying to do the same. Getting this 2nd jab has been a nightmare. Places listed as walk-in having Moderna, don't and they ran out of it when I went to get my secheduled jab. Ringing 119 just ends up in a dead line, then this outage. Fun.

throwawaysha|4 years ago

I ran DNS servers, among other things, in the late 90s with better uptime than these "multi-DC/AZ/geo redundant" services everyone uses these days.

topranks|4 years ago

With all due respect, having also run auth DNS servers in the 90s, and seen the inside of Akamai’s CDN/DNS setup more recently, it isn’t remotely at the same level of scale or sophistication.

conqrr|4 years ago

[deleted]

fredski42|4 years ago

I thought DNS was supposed to be resilient

topspin|4 years ago

DNS is designed to be fault tolerant. Such a design, however, is often not leveraged correctly; the implementation of DNS can be and frequently is subject to SPOFs.

simonswords82|4 years ago

I'm sick and tired of these types of services (I'm looking at you too Cloudflare) going down and taking otherwise healthy websites down with them.

ceejayoz|4 years ago

Most websites using Akamai aren't gonna be "otherwise healthy" without the CDN handling most of the load.

tootie|4 years ago

It was fastly last time.

sammy2244|4 years ago

Cloudflare hasnt had an outage in a long time. And when they do they are upfront about it, and post a detailed post-mortem.

gianpaj|4 years ago

https://www.interactivebrokers.co.uk/ , a Trading Platform, is also down as well :(

How am I going to sell my AMC stock...

swarnie_|4 years ago

You don't, you hold the dumb, over priced stock as a reminder for future, better informed investing.