1) Any chance this bug would have manifested as "connection reset" errors when accessing HN? I exchanged email with Dan a couple months ago trying to figure out why about 10% of my requests were failing, but we never figured out the root cause before (after some weeks) the problem went away.
2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.
3) This seems like a wonderful argument for _against_ running your high performance DNS server on the same machine as your other services. Would containerization have helped here, possibly with each OS pinned to a set of cores? Is the cost of splitting it off onto a separate physical machine prohibitive? At the optimization level you are aiming for, "process isolation" seems like a pretty leaky abstraction.
4) Going farther down the path of reducing jitter, have you considered running a tickless (NOHZ_FULL) kernel? Perhaps you are already doing it, but quieting down the cores running your important services can make a significant difference. I've been exploring this direction, and have found it rewarding. Info on one way to measure is here: https://kernel.googlesource.com/pub/scm/linux/kernel/git/fre...
1) Not really. Connection resets have nothing to do here. Are you by any chance running on a slow network? We did some debugging of problems slow clients recently.
2) Correct. Increasing number of buckets is a bandaid. Improving the linux code for inet_lookup_listener is not a trivial task. But for some reason Linux did optimize similar function for UDP. Wonders of the network stack.
3) Not really, dns can coexist with other services. The bug was having many TCP listening sockets on the same port without a good reason really. There is no DDoS benefit in splitting receive queues for TCP.
Containers won't help, the packets go to the same instance of LHTABLE.
4) Scheduling jitter is a subject for separate story.
> 2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.
I am not familiar with the code, but it looks to me like having a separate hash table for address-bound sockets like you suggest would work.
I suggest using some sort of extensible hash table (possibly with a bounded maximum size so things do not become crazy) to avoid statically sizing it. That way it is not too big on systems where it is not often used and not too small on systems where it is heavily used. The type of extensible hash table should be picked for compatibility with RCU to avoid causing a regression in multicore scaling. A quick google search suggests that a split-ordered list might work here:
As for the hash function, I would suggest adding the IP address and a random value calculated at boot time to the sum calculated by the current hash function before it currently truncates, run that sum through a next() function from a fast PRNG and then truncate the result to whatever the extensible hash table wants at the given moment. As for potential PRNG next() function donors, the xorshift128+ at the following site is fairly nice:
The idea behind using a PRNG's next() function would be to avoid artifacts in how IP addresses are assigned that would cause abnormally many hash table collisions. The idea behind adding a random number to the sum before the hash function is to ensure that people could not pick IP address + port combinations that hash to the same location as part of an algorithmic complexity attack. The potential for an algorithmic complexity attack here is relatively minor given that the problem occurs in normal use, but fixing it is rather inexpensive, so someone might as well do it when changing this.
Anyway, this does not do anything for the wild card case, but barring gratuitous use of SO_REUSEPORT, the number of sockets should be be small. Cloudflare is using address-bound sockets, so it probably would not matter for them.
Why did increasing the size of the hash help? Wouldn't the %64 (or whatever the new value was) just send all the port 53 sockets into the same bucket again? It seems you'd need a different hash function that provides more uniformity.
For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, :53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.
I suspect that's the real fix. Now all those (16k) bound addresses aren't creating hash table entries, so other connections that happen to use a port that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a hash bucket that starts with 16k entries already in it.
The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.
It would reduce the probability of collision. Sure, port 53 will always go to unlucky bucket. But port 16275 may or may not collide depending on thehash size.
As others have answered, it doesn't help fix this problem in any way. I'm assuming the team doing this investigation wasn't willing to rewrite `__inet_lookup_listener`. I guess the quick fix became the concern at that point. (Compared to rewriting kernel modules.)
In the internet protocol suite, there is something called ARP. That will cause a small latency spike whenever an ARP cache entry expires.
The host wants to send a datagram to some IP address Y somewhere. It knows that the route for that IP goes through some local gateway with IP address X, so it must actually send the datagram to X, for which it needs the MAC. Normally, it knows the MAC of that gateway, because X is associated with the MAC in the ARP cache.
From time to time, the ARP cache entry expires. In that case, before the packet can be sent to the remote host Y, an ARP query must be performed: an ARP "who has X?" broadcast request is generated, to which the gateway will reply. Only then can that host send the packet to Y, through X.
This extra ARP exchange shouldn't take anywhere near 100 milliseconds, of course. But that is beyond the control of the host that is querying.
I'm not entirely convinced that increasing the size of LHTABLE solves anything. True, it may remove some collisions in the hash table but note that 63925 % 64 = 53. Given that the two slow ports listed seem to be arbitrary and assigned to customers, they're probably just a symptom of the overload on 53/UDP. I'm not suggesting they chose 64, that's unclear, but whatever they chose probably just shifts problem the elsewhere. Increasing it /would/ inherently reduce the frequency of the events though so you can call that a win.
A naïve solution would be to choose a bucket based on the destination port as well as the source port if one is available (e.g. TCP). This might help balance load affecting particular local ports since we can assume the source port for TCP will be random enough. However, it doesn't solve the problem - it'll just hide it. Random spikes in latency for connections to random customers? Sounds undesirable.
A reasonable solution might be to work out a way to map gateway 53/UDP to a diverse set of ports which are bound to rrdns processes on the boxes which currently have 16K IP addresses. For UDP packets, this would be possible by doing on-wire modifications to the transport header and recalculating any checksums. Perhaps that just shifts the burden though.
You can't include the source port in the hash, because this is a table of listening sockets, i.e. no connection has been established yet and the socket needs to see packets from ANY source as long as they go to the right destination.
You could suggest including the bound destination IP in the hash, but then you'd also need a separate hashtable for sockets that are bound to any IP (instead of being bound to a specific IP).
Why not look into changing the hash keys such that they are not bound to listening port or alternatively put a new lookup and hash in place for this specific purpose?
While bind to star works, it feels like you answered an operational concern but left the design consideration on the table.
I wonder how if this was IPv6 if this problem would still entail and if IPv6 has design aspects that would negate this issue?
Would the lookup be different or the design aspect that produces what many would class a edge case instance, one that as companies grow, is only going to become more common.
It's present in my netstat, from `net-tools 2.10-alpha`. Man page says it's shorthand for `--protocol=inet` (as opposed to `inet6`) which shows only IPv4 sockets.
In order to reduce the number of packets hitting the bad bucket. With LHTABLE of size 32 port 16257 will hit bad bucket. With other sizes port 16257 may not hit the naughty bucket.
That would actually be pretty cool, but the data would have to be redistributed across the new set of buckets when the parameter was changed. That could get really messy...
[+] [-] nkurz|10 years ago|reply
1) Any chance this bug would have manifested as "connection reset" errors when accessing HN? I exchanged email with Dan a couple months ago trying to figure out why about 10% of my requests were failing, but we never figured out the root cause before (after some weeks) the problem went away.
2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.
3) This seems like a wonderful argument for _against_ running your high performance DNS server on the same machine as your other services. Would containerization have helped here, possibly with each OS pinned to a set of cores? Is the cost of splitting it off onto a separate physical machine prohibitive? At the optimization level you are aiming for, "process isolation" seems like a pretty leaky abstraction.
4) Going farther down the path of reducing jitter, have you considered running a tickless (NOHZ_FULL) kernel? Perhaps you are already doing it, but quieting down the cores running your important services can make a significant difference. I've been exploring this direction, and have found it rewarding. Info on one way to measure is here: https://kernel.googlesource.com/pub/scm/linux/kernel/git/fre...
[+] [-] majke|10 years ago|reply
2) Correct. Increasing number of buckets is a bandaid. Improving the linux code for inet_lookup_listener is not a trivial task. But for some reason Linux did optimize similar function for UDP. Wonders of the network stack.
3) Not really, dns can coexist with other services. The bug was having many TCP listening sockets on the same port without a good reason really. There is no DDoS benefit in splitting receive queues for TCP.
Containers won't help, the packets go to the same instance of LHTABLE.
4) Scheduling jitter is a subject for separate story.
[+] [-] Florin_Andrei|10 years ago|reply
[+] [-] ryao|10 years ago|reply
I am not familiar with the code, but it looks to me like having a separate hash table for address-bound sockets like you suggest would work.
I suggest using some sort of extensible hash table (possibly with a bounded maximum size so things do not become crazy) to avoid statically sizing it. That way it is not too big on systems where it is not often used and not too small on systems where it is heavily used. The type of extensible hash table should be picked for compatibility with RCU to avoid causing a regression in multicore scaling. A quick google search suggests that a split-ordered list might work here:
https://lwn.net/Articles/573431/
As for the hash function, I would suggest adding the IP address and a random value calculated at boot time to the sum calculated by the current hash function before it currently truncates, run that sum through a next() function from a fast PRNG and then truncate the result to whatever the extensible hash table wants at the given moment. As for potential PRNG next() function donors, the xorshift128+ at the following site is fairly nice:
http://xorshift.di.unimi.it/
The idea behind using a PRNG's next() function would be to avoid artifacts in how IP addresses are assigned that would cause abnormally many hash table collisions. The idea behind adding a random number to the sum before the hash function is to ensure that people could not pick IP address + port combinations that hash to the same location as part of an algorithmic complexity attack. The potential for an algorithmic complexity attack here is relatively minor given that the problem occurs in normal use, but fixing it is rather inexpensive, so someone might as well do it when changing this.
Anyway, this does not do anything for the wild card case, but barring gratuitous use of SO_REUSEPORT, the number of sockets should be be small. Cloudflare is using address-bound sockets, so it probably would not matter for them.
[+] [-] charliedevolve|10 years ago|reply
[+] [-] kbenson|10 years ago|reply
I suspect that's the real fix. Now all those (16k) bound addresses aren't creating hash table entries, so other connections that happen to use a port that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a hash bucket that starts with 16k entries already in it.
The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.
[+] [-] majke|10 years ago|reply
[+] [-] xigency|10 years ago|reply
But yes, it seems like an unnecessary change.
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] readams|10 years ago|reply
[+] [-] kazinator|10 years ago|reply
The host wants to send a datagram to some IP address Y somewhere. It knows that the route for that IP goes through some local gateway with IP address X, so it must actually send the datagram to X, for which it needs the MAC. Normally, it knows the MAC of that gateway, because X is associated with the MAC in the ARP cache.
From time to time, the ARP cache entry expires. In that case, before the packet can be sent to the remote host Y, an ARP query must be performed: an ARP "who has X?" broadcast request is generated, to which the gateway will reply. Only then can that host send the packet to Y, through X.
This extra ARP exchange shouldn't take anywhere near 100 milliseconds, of course. But that is beyond the control of the host that is querying.
[+] [-] deegles|10 years ago|reply
[+] [-] ecma|10 years ago|reply
A naïve solution would be to choose a bucket based on the destination port as well as the source port if one is available (e.g. TCP). This might help balance load affecting particular local ports since we can assume the source port for TCP will be random enough. However, it doesn't solve the problem - it'll just hide it. Random spikes in latency for connections to random customers? Sounds undesirable.
A reasonable solution might be to work out a way to map gateway 53/UDP to a diverse set of ports which are bound to rrdns processes on the boxes which currently have 16K IP addresses. For UDP packets, this would be possible by doing on-wire modifications to the transport header and recalculating any checksums. Perhaps that just shifts the burden though.
[+] [-] eridius|10 years ago|reply
You could suggest including the bound destination IP in the hash, but then you'd also need a separate hashtable for sockets that are bound to any IP (instead of being bound to a specific IP).
[+] [-] Gratsby|10 years ago|reply
While bind to star works, it feels like you answered an operational concern but left the design consideration on the table.
[+] [-] forgotmypassw|10 years ago|reply
[+] [-] alberth|10 years ago|reply
I'm curious is if Cloudflare has investigated using DrafonflyBSD, given that it has a lockless network stack.
[+] [-] caf|10 years ago|reply
[+] [-] shiftoutbox|10 years ago|reply
[+] [-] ishtu|10 years ago|reply
[+] [-] lazyant|10 years ago|reply
[+] [-] pyvpx|10 years ago|reply
[+] [-] Zenst|10 years ago|reply
Would the lookup be different or the design aspect that produces what many would class a edge case instance, one that as companies grow, is only going to become more common.
[+] [-] triplenineteen|10 years ago|reply
[+] [-] dice|10 years ago|reply
[+] [-] Negitivefrags|10 years ago|reply
[+] [-] lttlrck|10 years ago|reply
[+] [-] js2|10 years ago|reply
[+] [-] majke|10 years ago|reply
[+] [-] known|10 years ago|reply
[+] [-] wrigby|10 years ago|reply