Out of curiosity, why do caching DNS resolvers, such as the DNS resolver I run on my home network, not provide an option to retain last-known-good resolutions beyond the authority-provided time to live? In such a configuration, after the TTL expiration, the resolver would attempt to refresh from the authority/upstream provider, but if that attempt fails, the response would be a more graceful failure of returning a last-known-good resolution (perhaps with a flag). This behavior would continue until an administrator-specified and potentially quite generous maximum TTL expires, after which nodes would finally see resolution failing outright.
Ideally, then, the local resolvers of the nodes and/or the UIs of applications could detect the last-known-good flag on resolution and present a UI to users ("DNS authority for this domain is unresponsive; you are visiting a last-known-good IP provided by a resolution from 8 hours ago."). But that would be a nicety, and not strictly necessary.
Is there a spectacular downside to doing so? Since the last-known-good resolution would only be used if a TTL-specified refresh failed, I don't see much downside.
It'd be nice to have a "backup TTL" included, to allow sites to specify whether and how long they wanted such caching behavior.
Also, that cache would need to only kick in when the server was unreachable or produced SERVFAIL, not when it returned a negative result. Negative results returned by the authoritative server are correct, and should not result in the recursive resolver returning anything other than a negative result.
I've been thinking of adding this exact feature to my DNS framework that I've been working on (if github was resolving): https://github.com/bluejekyll/trust-dns
> Is there a spectacular downside to doing so? Since the last-known-good resolution would only be used if a TTL-specified refresh failed, I don't see much downside.
Because you would keep old DNS records around forever if a server goes away for good. So you need to have a timeout for that anyways.
HTTP has a good solution/proposal for this: the server can include a stale-on-error=someTimeInSeconds header in addition to the TTL and then every cache is allowed to continue serving stale data for the specified time while the origin is unreachable. Probably a good idea to include such a mechanism in DNS, too.
i seem to remember that dns has generally been reliable (until recently, i guess), probably nobody has ever thought that to be necessary.
you could write a cron script that generates a date-stamped hosts file based on a list of your top-used domain names, and simply use that on your machine(s) if your dns ever goes down. that's basically a very simple local dns cache.
if you feel like living dangerously, have it update /etc/hosts directly.
I think a problem that you might be overlooking is that DNS lookups aren't just failing, they are also very slow when a DDOS attack is underway on the authority servers. This introduces a latency shock to the system which causes cascading failures.
All will break the moment that one of the websites that you access makes a server-side request to another website ( think about logging-services, server-clusters, database servers, etc - they all either have IPs or most-likely some domains. )
I wanted to provide an update on the PagerDuty service. At this time we have been able to restore the service by migrating to our secondary DNS provider. If you are still experiencing issues reaching any pagerduty.com addresses, please flush your DNS cache. This should restore your access to the service. We are actively monitoring our service and are working to resolve any outstanding issues. We sincerely apologize for the inconvenience and thank our customers for their support and patience. Real-time updates on all incidents can be found on our status page and on Twitter at @pagerdutyops and @pagerduty. In case of outages with our regular communications channels, we will update you via email directly.
In addition you can reach out to our customer support team at [email protected] or +1 (844) 700-3889.
Tim Armandpour, SVP of Product Development, PagerDuty
I had the privilege of being on-call during this entire fiasco today and I have to say I was really really disappointed. It's surprising how broken your entire service was when DNS went down. I couldn't acknowledge anything, and my secondary on-call was getting paged because it looked like I wasn't trying to respond. I was getting phone calls for alerts that wasn't even showing up on the web client, etc. Overall, it caused chaos and I was really disappointed.
I appreciate the update, but your service has been unavailable for hours already. This is unacceptable for a service whose core value is to ensure that we know about any incidents.
Sorry if this sounds dickish, but renting 3 servers @ $75 apiece from 3 different dedicated server companies in the USA, putting TinyDNS on them, and using them as backup servers, would have solved your problems hours ago.
Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....
EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.
"Challenges" is exactly the sort of Dilbertesque euphemism that you should never say in a situation like this.
Calling it a "challenge" implies that there is some difficult, but possible, action that the customer could take to resolve the issue. Since that is not the case, this means either you don't understand what's going on, or you're subtly mocking your customers inadvertently.
Try less to make things sound nice and MBAish, and try more to just communicate honestly and directly using simple language.
Running multiple DNS providers is not actually that difficult and certainly not cost prohibitive. I am sure after this, we will see lots of companies adding multiple DNS providers and switching to AWS Route53 (which has always been solid for me).
PagerDuty outage is the real low point of this whole situation. Email alerts from PagerDuty that should have alerted of the outage in the first place, only got delivered hours later after the whole mess cleared out.
I'm a GitHub employee and want to let everyone know we're aware of the problems this incident is causing and are actively working to mitigate the impact.
"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.
I'm curious why you don't host your status page on a different domain/provider? When checking this AM why GitHub was down, I also couldn't reach the status page.
If this is consistently a problem why doesn't Github have fallback TLDs that use different DNS providers? Or even just code the site to work with static IPs. I tried the Github IP and it didn't load, but that could be for an unrelated issue.
Another status update from GitHub: "We have migrated to an unaffected DNS provider. Some users may experience problems with cached results as the change propagates."
We're maintaining yellow status for the foreseeable future while the changes to our NS records propagate. If you have the ability to flush caches for your resolver, this may help restore access.
Name Server: ns1.p44.dynect.net
Name Server: ns2.p44.dynect.net
Name Server: ns3.p44.dynect.net
Name Server: ns4.p44.dynect.net
Name Server: sdns3.ultradns.biz
Name Server: sdns3.ultradns.com
Name Server: sdns3.ultradns.net
Name Server: sdns3.ultradns.org
ultradns.biz:
Name Server: PDNS196.ULTRADNS.ORG
Name Server: ARI.ALPHA.ARIDNS.NET.AU
Name Server: ARI.BETA.ARIDNS.NET.AU
Name Server: ARI.GAMMA.ARIDNS.NET.AU
Name Server: ARI.DELTA.ARIDNS.NET.AU
Name Server: PDNS196.ULTRADNS.NET
Name Server: PDNS196.ULTRADNS.COM
Name Server: PDNS196.ULTRADNS.BIZ
Name Server: PDNS196.ULTRADNS.INFO
Name Server: PDNS196.ULTRADNS.CO.UK
Name Server: ns2.p16.dynect.net
Name Server: ns-1283.awsdns-32.org.
Name Server: ns-1707.awsdns-21.co.uk.
Name Server: ns-421.awsdns-52.com.
Name Server: ns1.p16.dynect.net
Name Server: ns4.p16.dynect.net
Name Server: ns3.p16.dynect.net
Name Server: ns-520.awsdns-01.net.
Journalist and security researcher Brian Krebs believes this is someone doing a DDoS as payback for research into questionable "DDoS mitigation services" that he and Dyn's Doug Madory did. Doug just presented his results yesterday at NANOG and Krebs believes this is payback. Read more: https://krebsonsecurity.com/2016/10/ddos-on-dyn-impacts-twit...
I'm wondering, from a regulatory perspective, what might be done to mitigate DDoS attacks in the future?
From comments made on this and other similar posts in the past, I've gathered the following:
1) Malicious traffic often uses a spoofed IP address, which is detectable by ISPs. What if ISPs were not allowed to forward such traffic?
2) There is no way for a service to exert back pressure. What if there was? e.g. send a response indicating the request was malicious (or simply unwanted due to current traffic levels), and a router along the way would refuse to send follow up requests for some time. There is HTTP status code 429, but that is entirely dependent on a well-behaved client. I'm talking about something at the packet level, enforced by every hop along the way.
3) I believe it is suspected that a substantial portion of the traffic is from compromised IoT devices. What if IoT devices were required to continually pass some sort of a health check to make other HTTP requests? This could be enforced at the hardware/firmware level (much harder to change with malware), and, say, send a signature of the currently running binary (or binaries) to a remote server which gave the thumbs up/down.
Although I don't like to to recommend Google products, they provide a provide a public DNS-over-HTTPS interface that should be useful for people who want to add specific entries into their /etc/hosts files: https://dns.google.com/query?name=github.com&type=A&dnssec=t...
"digikey.com", the big electronic part distributor, is currently inaccessible. DNS lookups are failing with SERVFAIL. Even the Google DNS server (8.8.8.8) can't resolve that domain. Their DNS servers are "ns1.p10.dynect.net" through "ns4.p10.dynect.net", so it's a Dyn problem.
This will cause supply-chain disruption for manufacturers using DigiKey for just-in-time supply.
(justdownforme.com says the site is down, but
downforeveryoneorjustme.com says it's up. They're probably caching DNS locally.)
If you're having issues with people accessing your running Heroku apps, it's likely because you're running your DNS through herokussl.com (with their SSL endpoint product) which is hosted on Dyn.
If you can update your DNS to CNAME directly to the ELB behind it, it should at least make your site accessible.
Just to be clear, this is a DDoS against Dynect's NS hosts, right?
I'm confused because of the use of "dyn dns", which to me means dns for hosts that don't have static ip addresses.
I'm actually surprised so many big-name sites rely on Dynect, which I hadn't heard of, but more importantly don't seem to use someone else's NS hosts as 2nd or 4th entries.
[+] [-] bhauer|9 years ago|reply
Ideally, then, the local resolvers of the nodes and/or the UIs of applications could detect the last-known-good flag on resolution and present a UI to users ("DNS authority for this domain is unresponsive; you are visiting a last-known-good IP provided by a resolution from 8 hours ago."). But that would be a nicety, and not strictly necessary.
Is there a spectacular downside to doing so? Since the last-known-good resolution would only be used if a TTL-specified refresh failed, I don't see much downside.
[+] [-] davidu|9 years ago|reply
It's called SmartCache.
[+] [-] Kalium|9 years ago|reply
[+] [-] JoshTriplett|9 years ago|reply
Also, that cache would need to only kick in when the server was unreachable or produced SERVFAIL, not when it returned a negative result. Negative results returned by the authoritative server are correct, and should not result in the recursive resolver returning anything other than a negative result.
[+] [-] bluejekyll|9 years ago|reply
If you have any feedback, I'd love to hear it.
[+] [-] the_mitsuhiko|9 years ago|reply
Because you would keep old DNS records around forever if a server goes away for good. So you need to have a timeout for that anyways.
[+] [-] DivineTraube|9 years ago|reply
https://tools.ietf.org/html/rfc5861
[+] [-] jasimp|9 years ago|reply
Don't want to say much more due to it being my job, and I don't want to give away too much.
EDIT: https://www.google.com/patents/US8583801
[+] [-] beachstartup|9 years ago|reply
you could write a cron script that generates a date-stamped hosts file based on a list of your top-used domain names, and simply use that on your machine(s) if your dns ever goes down. that's basically a very simple local dns cache.
if you feel like living dangerously, have it update /etc/hosts directly.
[+] [-] LordHumungous|9 years ago|reply
[+] [-] drinchev|9 years ago|reply
[+] [-] jedisct1|9 years ago|reply
https://github.com/jedisct1/edgedns
[+] [-] zwily|9 years ago|reply
[+] [-] scrollaway|9 years ago|reply
https://www.schneier.com/blog/archives/2016/09/someone_is_le...
Edit: And to be clear: I don't mean to imply there's any connection :)
[+] [-] tim_armandpour|9 years ago|reply
In addition you can reach out to our customer support team at [email protected] or +1 (844) 700-3889.
Tim Armandpour, SVP of Product Development, PagerDuty
[+] [-] pfarnsworth|9 years ago|reply
[+] [-] kirizt|9 years ago|reply
[+] [-] patrickg_zill|9 years ago|reply
Even a single quad-core server with 4GB RAM running TinyDNS could serve 10K queries per second, based on extrapolation and assumed improvements since this 2001 test, which showed nearly 4K/second performance on 700Mhz PIII CPUs: https://lists.isc.org/pipermail/bind-users/2001-June/029457....
EDIT to add: and lengthening TTLs temporarily would mean that those 10K queries would quickly lessen the outage, since each query might last for 12 hours; and large ISPs like Comcast would cache the queries for all their customers, so a single successful query delivered to Comcast would have (some amount) of multiplier effect.
[+] [-] pjlegato|9 years ago|reply
Calling it a "challenge" implies that there is some difficult, but possible, action that the customer could take to resolve the issue. Since that is not the case, this means either you don't understand what's going on, or you're subtly mocking your customers inadvertently.
Try less to make things sound nice and MBAish, and try more to just communicate honestly and directly using simple language.
[+] [-] nodesocket|9 years ago|reply
[+] [-] AlanBoyce69|9 years ago|reply
[+] [-] recycle|9 years ago|reply
[+] [-] cjbprime|9 years ago|reply
[+] [-] jssjr|9 years ago|reply
"A global event is affecting an upstream DNS provider. GitHub services may be intermittently available at this time." is the content from our latest status update on Twitter (https://twitter.com/githubstatus/status/789452827269664769). Reposted here since some people are having problems resolving Twitter domains as well.
[+] [-] cddotdotslash|9 years ago|reply
[+] [-] lanna|9 years ago|reply
[+] [-] afshinmeh|9 years ago|reply
[+] [-] 3pt14159|9 years ago|reply
[+] [-] jssjr|9 years ago|reply
We're maintaining yellow status for the foreseeable future while the changes to our NS records propagate. If you have the ability to flush caches for your resolver, this may help restore access.
Latest status message: https://twitter.com/githubstatus/status/789565863649304576
[+] [-] BlackGuyCoding|9 years ago|reply
[+] [-] JoshGlazebrook|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] elwell|9 years ago|reply
Edit; for profile pics include:
[+] [-] Animats|9 years ago|reply
pornhub.com:
ultradns.biz:[+] [-] Animats|9 years ago|reply
pagerduty.com:
Pagerduty annoucement: "If you are having issues reaching any pagerduty.com address please flush your DNS cache to resolve the issue."[+] [-] Animats|9 years ago|reply
github.com:
[+] [-] dEnigma|9 years ago|reply
1. Tried to download "Unknown Horizons" (game featured recently on Hacker News) binary, github-link doesn't work.
2. Think "Ok, might be an old link", google their github-repository, github appears down.
3. Try accessing github status website, is down.
4. Interested, try to visit github status twitter account, twitter is down.
Really weird experience, normally at least the second source of news on a downed website I try during an attack works.
[+] [-] foobarbecue|9 years ago|reply
"Popular tech site Hacker News reported many other sites were affected including Etsy, Spotify, Github, Soundcloud, and Heroku." -- http://fortune.com/2016/10/21/internet-outages/
[+] [-] meshko|9 years ago|reply
[+] [-] chromaton|9 years ago|reply
$ dig @8.8.8.8 www.paypal.com
; <<>> DiG 9.8.1-P1 <<>> @8.8.8.8 www.paypal.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 17925 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION: ;www.paypal.com. IN A
;; Query time: 29 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Fri Oct 21 12:35:33 2016 ;; MSG SIZE rcvd: 32
[+] [-] sly010|9 years ago|reply
[+] [-] jtmarmon|9 years ago|reply
So far twitter, etsy, soundcloud, spotify, github, pagerduty...crazy that this can even happen
[+] [-] danyork|9 years ago|reply
[+] [-] rybosome|9 years ago|reply
From comments made on this and other similar posts in the past, I've gathered the following:
1) Malicious traffic often uses a spoofed IP address, which is detectable by ISPs. What if ISPs were not allowed to forward such traffic?
2) There is no way for a service to exert back pressure. What if there was? e.g. send a response indicating the request was malicious (or simply unwanted due to current traffic levels), and a router along the way would refuse to send follow up requests for some time. There is HTTP status code 429, but that is entirely dependent on a well-behaved client. I'm talking about something at the packet level, enforced by every hop along the way.
3) I believe it is suspected that a substantial portion of the traffic is from compromised IoT devices. What if IoT devices were required to continually pass some sort of a health check to make other HTTP requests? This could be enforced at the hardware/firmware level (much harder to change with malware), and, say, send a signature of the currently running binary (or binaries) to a remote server which gave the thumbs up/down.
[+] [-] Animats|9 years ago|reply
This is worth reading. It has links to copies of the code and names the known control servers. Quite a bit is known now about how this thing works.
The bots talk to control servers and report servers. The attacker appears to communicate with the report servers over Tor.
[1] http://blog.level3.com/security/grinch-stole-iot/
[+] [-] Mizza|9 years ago|reply
[+] [-] Animats|9 years ago|reply
This will cause supply-chain disruption for manufacturers using DigiKey for just-in-time supply.
(justdownforme.com says the site is down, but downforeveryoneorjustme.com says it's up. They're probably caching DNS locally.)
[+] [-] newsat13|9 years ago|reply
[+] [-] bgentry|9 years ago|reply
If you can update your DNS to CNAME directly to the ELB behind it, it should at least make your site accessible.
[+] [-] cm3|9 years ago|reply
I'm confused because of the use of "dyn dns", which to me means dns for hosts that don't have static ip addresses.
I'm actually surprised so many big-name sites rely on Dynect, which I hadn't heard of, but more importantly don't seem to use someone else's NS hosts as 2nd or 4th entries.