Hidden latency in the Linux network stack

[+] nkurz|10 years ago|reply

Stray thoughts:

1) Any chance this bug would have manifested as "connection reset" errors when accessing HN? I exchanged email with Dan a couple months ago trying to figure out why about 10% of my requests were failing, but we never figured out the root cause before (after some weeks) the problem went away.

2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.

3) This seems like a wonderful argument for _against_ running your high performance DNS server on the same machine as your other services. Would containerization have helped here, possibly with each OS pinned to a set of cores? Is the cost of splitting it off onto a separate physical machine prohibitive? At the optimization level you are aiming for, "process isolation" seems like a pretty leaky abstraction.

4) Going farther down the path of reducing jitter, have you considered running a tickless (NOHZ_FULL) kernel? Perhaps you are already doing it, but quieting down the cores running your important services can make a significant difference. I've been exploring this direction, and have found it rewarding. Info on one way to measure is here: https://kernel.googlesource.com/pub/scm/linux/kernel/git/fre...

[+] majke|10 years ago|reply

1) Not really. Connection resets have nothing to do here. Are you by any chance running on a slow network? We did some debugging of problems slow clients recently.

2) Correct. Increasing number of buckets is a bandaid. Improving the linux code for inet_lookup_listener is not a trivial task. But for some reason Linux did optimize similar function for UDP. Wonders of the network stack.

3) Not really, dns can coexist with other services. The bug was having many TCP listening sockets on the same port without a good reason really. There is no DDoS benefit in splitting receive queues for TCP.

Containers won't help, the packets go to the same instance of LHTABLE.

4) Scheduling jitter is a subject for separate story.

[+] Florin_Andrei|10 years ago|reply

If the issue is in the kernel, containers won't solve it.

[+] ryao|10 years ago|reply

> 2) As others have pointed out, doubling the number of hash buckets seems like a bandaid. But other than the scolding comment, is there any reason not to go to an appropriately sized hash? If you know in advance that you are going to have 16K addresses (ie, not the use case the original code anticipated), it would seem beneficial to choose a data structure that has a fighting chance of providing good performance.

I am not familiar with the code, but it looks to me like having a separate hash table for address-bound sockets like you suggest would work.

I suggest using some sort of extensible hash table (possibly with a bounded maximum size so things do not become crazy) to avoid statically sizing it. That way it is not too big on systems where it is not often used and not too small on systems where it is heavily used. The type of extensible hash table should be picked for compatibility with RCU to avoid causing a regression in multicore scaling. A quick google search suggests that a split-ordered list might work here:

https://lwn.net/Articles/573431/

As for the hash function, I would suggest adding the IP address and a random value calculated at boot time to the sum calculated by the current hash function before it currently truncates, run that sum through a next() function from a fast PRNG and then truncate the result to whatever the extensible hash table wants at the given moment. As for potential PRNG next() function donors, the xorshift128+ at the following site is fairly nice:

http://xorshift.di.unimi.it/

The idea behind using a PRNG's next() function would be to avoid artifacts in how IP addresses are assigned that would cause abnormally many hash table collisions. The idea behind adding a random number to the sum before the hash function is to ensure that people could not pick IP address + port combinations that hash to the same location as part of an algorithmic complexity attack. The potential for an algorithmic complexity attack here is relatively minor given that the problem occurs in normal use, but fixing it is rather inexpensive, so someone might as well do it when changing this.

Anyway, this does not do anything for the wild card case, but barring gratuitous use of SO_REUSEPORT, the number of sockets should be be small. Cloudflare is using address-bound sockets, so it probably would not matter for them.

[+] charliedevolve|10 years ago|reply

Why did increasing the size of the hash help? Wouldn't the %64 (or whatever the new value was) just send all the port 53 sockets into the same bucket again? It seems you'd need a different hash function that provides more uniformity.

[+] kbenson|10 years ago|reply

For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, :53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.

I suspect that's the real fix. Now all those (16k) bound addresses aren't creating hash table entries, so other connections that happen to use a port that hashes to 21 (or 53 after enlarging the table) aren't being shoved into a hash bucket that starts with 16k entries already in it.

The enlarging of the hash table I think is less a fix for this problem (although it would halve the number of later connections being put in the bucket), and more just a good fix they happened to do at the same time.

[+] majke|10 years ago|reply

It would reduce the probability of collision. Sure, port 53 will always go to unlucky bucket. But port 16275 may or may not collide depending on thehash size.

[+] xigency|10 years ago|reply

As others have answered, it doesn't help fix this problem in any way. I'm assuming the team doing this investigation wasn't willing to rewrite `__inet_lookup_listener`. I guess the quick fix became the concern at that point. (Compared to rewriting kernel modules.)

But yes, it seems like an unnecessary change.

[+] unknown|10 years ago|reply

[deleted]

[+] readams|10 years ago|reply

They might hash the source port also. Maybe they didn't mention this.

[+] kazinator|10 years ago|reply

In the internet protocol suite, there is something called ARP. That will cause a small latency spike whenever an ARP cache entry expires.

The host wants to send a datagram to some IP address Y somewhere. It knows that the route for that IP goes through some local gateway with IP address X, so it must actually send the datagram to X, for which it needs the MAC. Normally, it knows the MAC of that gateway, because X is associated with the MAC in the ARP cache.

From time to time, the ARP cache entry expires. In that case, before the packet can be sent to the remote host Y, an ARP query must be performed: an ARP "who has X?" broadcast request is generated, to which the gateway will reply. Only then can that host send the packet to Y, through X.

This extra ARP exchange shouldn't take anywhere near 100 milliseconds, of course. But that is beyond the control of the host that is querying.

[+] deegles|10 years ago|reply

I love these mysterious bug deep dives. I should start aggregating them somewhere...

[+] ecma|10 years ago|reply

I'm not entirely convinced that increasing the size of LHTABLE solves anything. True, it may remove some collisions in the hash table but note that 63925 % 64 = 53. Given that the two slow ports listed seem to be arbitrary and assigned to customers, they're probably just a symptom of the overload on 53/UDP. I'm not suggesting they chose 64, that's unclear, but whatever they chose probably just shifts problem the elsewhere. Increasing it /would/ inherently reduce the frequency of the events though so you can call that a win.

A naïve solution would be to choose a bucket based on the destination port as well as the source port if one is available (e.g. TCP). This might help balance load affecting particular local ports since we can assume the source port for TCP will be random enough. However, it doesn't solve the problem - it'll just hide it. Random spikes in latency for connections to random customers? Sounds undesirable.

A reasonable solution might be to work out a way to map gateway 53/UDP to a diverse set of ports which are bound to rrdns processes on the boxes which currently have 16K IP addresses. For UDP packets, this would be possible by doing on-wire modifications to the transport header and recalculating any checksums. Perhaps that just shifts the burden though.

[+] eridius|10 years ago|reply

You can't include the source port in the hash, because this is a table of listening sockets, i.e. no connection has been established yet and the socket needs to see packets from ANY source as long as they go to the right destination.

You could suggest including the bound destination IP in the hash, but then you'd also need a separate hashtable for sockets that are bound to any IP (instead of being bound to a specific IP).

[+] Gratsby|10 years ago|reply

Why not look into changing the hash keys such that they are not bound to listening port or alternatively put a new lookup and hash in place for this specific purpose?

While bind to star works, it feels like you answered an operational concern but left the design consideration on the table.

[+] forgotmypassw|10 years ago|reply

Even though I'm not using CloudFlare myself, I always enjoy reading their blog posts, these guys are amazing hackers.

[+] alberth|10 years ago|reply

What about DragonflyBSD?

I'm curious is if Cloudflare has investigated using DrafonflyBSD, given that it has a lockless network stack.

[+] caf|10 years ago|reply

This wasn't a locking issue.

[+] shiftoutbox|10 years ago|reply

I think Matt Dillon would love to see that . Also I would love to see the analysis of their situation on a UNIX , BSD's and even Windows .

[+] ishtu|10 years ago|reply

Can anyone suggest a good reading on differences in linux and freebsd network stacks?

[+] lazyant|10 years ago|reply

here's a diagram for Linux http://www.linuxfoundation.org/images/1/1c/Network_data_flow...

[+] pyvpx|10 years ago|reply

just the code, for anything recent and useful. I'm not aware of anything else, sadly.

[+] Zenst|10 years ago|reply

I wonder how if this was IPv6 if this problem would still entail and if IPv6 has design aspects that would negate this issue?

Would the lookup be different or the design aspect that produces what many would class a edge case instance, one that as companies grow, is only going to become more common.

[+] triplenineteen|10 years ago|reply

It seems like they are using a non-standard netstat switch:

  $ netstat -ep4ln --udp

Is that '4' a typo or something?

[+] dice|10 years ago|reply

It's present in my netstat, from `net-tools 2.10-alpha`. Man page says it's shorthand for `--protocol=inet` (as opposed to `inet6`) which shows only IPv4 sockets.

[+] Negitivefrags|10 years ago|reply

It means only list ipv4 sockets. It's pretty standard across many command line network tools.

[+] lttlrck|10 years ago|reply

It would be helpful to show the improvement for the two changes made individually instead of lumping them together.

[+] js2|10 years ago|reply

What did you increase the size of LHTABLE to? I don't see that mentioned in the post.

[+] majke|10 years ago|reply

In order to reduce the number of packets hitting the bad bucket. With LHTABLE of size 32 port 16257 will hit bad bucket. With other sizes port 16257 may not hit the naughty bucket.

[+] known|10 years ago|reply

Can't we fix/configure it in /etc/sysctl.conf

[+] wrigby|10 years ago|reply

That would actually be pretty cool, but the data would have to be redistributed across the new set of buckets when the parameter was changed. That could get really messy...

60 comments