Why do we use the Linux kernel's TCP stack? (2016)

[+] brendangregg|5 years ago|reply

The Linux network stack has been rapidly evolving; nowadays we have XDP for a fast path through the kernel. But there's also a lot of moving parts that ensure high throughput on noisy networks, for different application workload patterns, to avoid bufferbloat, and to do this fairly across multiple processes/containers. I'd imagine one can develop a user-space stack that appeared better in a simple iperf microbenchmark in a lab, but would lose for real-world applications and networks (unless your real-world use case was very narrow).

I'd recommend understanding the stack before choosing to replace it (the old programmer's advice about understanding a gate before removing it). I summarized the TCP send path in my next book (sysperf2, with help from Linux network engineers); it involves:

  - Socket send buffers
  - Pacing
  - TSO
  - Congestion controls
  - Nagle
  - TSQ
  - qdisc (optional)
  - GSQ
  - BQL
  - TSO (NIC)

That's just the TCP send path.

I'm happy to expand all these acronyms, but if any are new to you, that's kind-of my point. These parts have a purpose, and starting from scratch may lead you to eventually re-invent them all as you encounter the problems they solve.

[+] techman9|5 years ago|reply

This is super useful! I've been hacking recently on qdisc classifiers, and trying to set a filter up with BPF, but there's a paucity of documentation on the subject. The tc tools are sorta documented here (https://man7.org/linux/man-pages/man8/tc-bpf.8.html), but it's quite hard to find docs on the underlying syscalls and the behavior of `bpf_prog_type_sched_cls`.

I bought a copy of BPF performance tools to learn more about the BPF interface, and it's really useful but quite focused on performance! I wish there was a resource of similar depth and breath about eBPF for classification/XDP.

[+] dilyevsky|5 years ago|reply

Big fan of your books and blog! Would you say usefulness of dpdk and xdp has diminished purely from performance standpoint? I hear rumors you can do line rate on modern linux kernels these days but haven’t had a chance to try it personally yet. I think at the time maglev was originally developed google was still on 2.6 kernels...

[+] kstenerud|5 years ago|reply

> These parts have a purpose, and starting from scratch may lead you to eventually re-invent them all as you encounter the problems they solve.

But is that really such a bad thing? Right now your choices are take it or leave it, or change operating systems. What if we could instead choose libraries suited to purpose, written by experts, like we do with encryption or compression or allocation or any number of common tasks?

You could then have a default implementation suitable for 80% of use cases, and specialized implementations to handle more complex requirements. You could even have per-NIC exclusive device access for a single application!

[+] bogomipz|5 years ago|reply

I didn't recognize GSQ and Google wasn't much help unfortunately. Could you say a few words about that and also maybe what "pacing" is? Thanks.

[+] ksec|5 years ago|reply

>The Linux network stack has been rapidly evolving

Are FreeBSD no longer getting any of those? It seems even Network Stack talks from Netflix are now Linux Focused.

It also makes me wonder..... does any of these matter once BPF takes over, quite literally.

[+] amelius|5 years ago|reply

> These parts have a purpose, and starting from scratch may lead you to eventually re-invent them all as you encounter the problems they solve.

But if you invent them in a different order, you might just end up with a better system ...

[+] notacoward|5 years ago|reply

There's no simple slam-dunk on either side of this issue, but I was a bit surprised that some of the more common arguments that I've heard in favor of kernel networking didn't get mentioned.

The #1 reason I've heard is features. When you use the kernel stack you have a dizzying array of protocols, congestion-control algorithms, and other features available to you. You have all sorts filtering and rerouting and rate-control features, from old iptables to netfilter and eBPF. There's all sorts of familiar monitoring. User-space stacks never duplicate all of this, nor should they. They should support their own use cases, no more, but that means there will still be a lot of people who are better off with the kernel stack.

The #2 reason I've heard is that the kernel stack is known to work in a broad variety of environments. User-space stacks are often used in constrained environments where the device only has to communicate in certain ways with certain peers across a certain network topology. The kernel stack is known to work in a vast array of combinations simultaneously with less need for local debugging. There are also security aspects here. For example, a good TCP/IP stack will have features to mitigate DoS attacks both inbound and outbound, whereas some user-space stacks do things like "forget to implement" proper congestion control so they can DoS others. Whatever vulnerabilities do exist in the kernel stack tend to become known and fixed rather quickly. In your user-space stack? Good luck.

I'm not actually arguing that more people should use the kernel stack. There are plenty of good reasons to go the other route when appropriate. I'm just trying to present arguments I've seen and which should be considered when making that choice. The OP seems to look at things from a bit of a "single machine, single purpose" POV, which naturally skews the outcome toward user-space stacks, but in an era of virtualization and containers we need to consider multi-purpose machines as well.

[+] ohazi|5 years ago|reply

These are the same reasons you should opt for an SoC that can run Linux over a "wifi microcontroller" if your embedded project needs networking and has the cost and power budget to support it. The Linux community has spent the last 25 years finding and fixing bugs that you absolutely will encounter if you use the Microchip or Espressif or SiLabs networking stack.

[+] regularfry|5 years ago|reply

Wouldn't both of these be neutralised somewhat if you took a rump-kernel approach of some sort? Use a real kernel network stack, just run it in user-space with whatever parts you don't need ripped out?

[+] Melkman|5 years ago|reply

I think an other reason to use the Linux stack is that it's performance is good enough for most every task. Writing your own stack will cost you a lot of time and is only advantageous if there is a big improvement in performance and that improvement is really needed. A bit like a software libraries and frameworks. If you write your application without frameworks you can probably speed up your application quite a bit. But it will cost you a lot of extra development time.

[+] dan-robertson|5 years ago|reply

I find #2 quite compelling.

I also find the monitoring differences are important. If you have a userspace stack then things like lsof or packet capturing will work differently. If your stack is polling rather than getting interrupts you likely won’t even be able to look at cpu usage or load averages (each polling loop will run at 100% cpu looking for new network activity)

[+] jvns|5 years ago|reply

Marek Majkowski wrote a response to this: "Why we use the Linux kernel's TCP stack" https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp...

[+] rrss|5 years ago|reply

I wonder if cloudflare is using mellanox libvma these days. The article notes they were using ef_vi on solarflare NICs at the time, but it looks like they switched to mellanox NICs in their 9th and 10th generation servers.

https://github.com/Mellanox/libvma

https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-ser...

[+] tekknolagi|5 years ago|reply

This is mentioned in the article.

[+] jeffbee|5 years ago|reply

The kernel TCP stack made a lot of sense when processes were almost as large as machines. When you had 1 CPU and 32MB of memory to run a web server then why wouldn't you use the kernel stack? But these days the host is as large as what we used to consider entire networks. You can easily have 256 CPU cores in a box now, and that box will be running hundreds of different processes at a time. Not all of those processes will be served equally well by whatever constants are defined in the kernel, and by whatever system-wide tuning parameters are available. And yet, there's still the problem that the box has one or a few hardware network interfaces, and something has to manage access to those resources. The architecture described at [1] is an interesting solution to this problem which avoids kernel networking, mediates NIC access, and provides custom stacks and parameters to each application.

1: Snap: a microkernel approach to host networking https://blog.acolyer.org/2019/11/11/snap-networking/

[+] cesaref|5 years ago|reply

As there was mention of SolarFlare cards, i'd just like to say that I had much success in a previous life with their products for reducing latency in trading systems - the products are really robust, and OpenOnload works really well.

At the most basic level you can run an existing binary under Onload (which basically uses an LD_LIBRARY_PATH to direct the socket system calls to their own stack) and get a significant performance improvement, especially if you are careful about how the card is configured, and to ensure your app is running on cores on the same socket that the network card is connected to.

Taking it further, you can use a zero copy API to get inbound packets whilst they are still in the network card buffer, and do very cool things with both hardware timestamping and pre-caching of packets on the network card memory (sort of like a message template) which can reduce the latency to sending responses if you have a set of stock responses you want to supply.

Anyhow, do check them out if you are into that sort of thing.

[+] hawk_|5 years ago|reply

i am into that kind of thing :-) if i wanted to try this kind of stuff at home, i.e. buying any necessary hardware myself, how much do you reckon the whole setup would cost? any suggestions on used NICs/switches to try this with?

[+] secondcoming|5 years ago|reply

I assume it's LD_PRELOAD instead of LD_LIBRARY_PATH ?

[+] IgorPartola|5 years ago|reply

A bunch of years ago I ended up writing a userpsace C program to essentially split incoming UDP packets to multiple destinations. I could have probably done that with iptables or some such but partially this was a fun exercise in writing a small C program that did big things. The company I worked for relied on these UDP packets as they were a core part of our business and as we migrated from our platform version 1 to the newly developed version 2 we needed both systems to receive the same stream of data. My little program I called wedge did that really well, down to spoofing the sender’s IP address. This allowed us to make a seamless migration of a huge number of customers without any downtime.

I say all this because I highly recommend playing with Linux’s raw packet interface and learning the structure of IP packets to do so. You can really gleam a ton of interesting info about it that way. Writing your own TCP stack was outside of what I needed but even the comparatively much simpler task of routing UDP packets was really fun and rewarding.

[+] vlovich123|5 years ago|reply

When you saw raw packets due you mean the libc network APIs or raw sockets?

It sounded like the latter but for anyone unaware such sockets require root/special privileges. It’s cool but of more limited general value and can be a security exploit as a user space exploit can launch really nasty network attacks that are otherwise impossible (hence why it’s not made available to user space).

[+] the8472|5 years ago|reply

> The TCP standard is evolving, and if you have to always use your kernel’s TCP stack, that means you can NEVER EVOLVE.

... if you don't update your kernel.

[+] throwaway894345|5 years ago|reply

That quote was in context of Android vendors who don’t update their kernel. So you’re making the same point as the article.

[+] chmod775|5 years ago|reply

Since the HN crowd will be mostly running HTTP(S) servers, the answer is that you probably should just be using the kernel's TCP stack, because however slow it is, it'll only use a tiny fraction of the CPU time your HTTP(S) server will be using, especially if you're doing SSL termination on the same machine.

Rolling a custom TCP stack is only interesting when the amount of work per TCP connection is minimal and you're in the territory of handling half a million or more concurrent connections. Think of caching servers etc.

However if you had that many concurrent connections to some website you'd already be comfortably sitting in the Alexa top 1k of internet services.

[+] ncmncm|5 years ago|reply

I do kernel bypass in production.

In my case, I am getting UDP (multicast) packets at a sometimes ferocious rate, that need various kinds of processing done, and all captured, with nanosecond timestamps, to disk.

The key is to get the kernel not involved at all. You use NICs from Solarflare (now a Xilinx company, which they used to be a customer of); or Exablaze (now owned by Cisco); Netronome; used to be, Mellanox (now owned by NVidia); or even Napatech (expensive). The NIC driver sets up a ring buffer in DMA memory and just starts dumping packets into it in real time. Each packet gets a little bit of metadata: a nanosecond timestamp, byte count, checksum. The NIC might be filtering by IP address or port, to distribute incoming packets to different ring buffers. The ring buffer is typically a few megabytes, enough for packets to be there for a few ms before they get overwritten.

Your program has to watch this ring buffer for updates, and do whatever it needs to do before the packet gets overwritten. I memcpy them to a big-ass ring buffer, say 8GB of hugepages, and then other unprivileged processes can pick over them for interesting bits, with more leeway for stalls, and can be started and stopped independently.

The process watching the NIC has to be protected against interruptions from the kernel, which involves a mess of kernel boot options -- nohz_full, isolcpus, rcu_nocbs -- because kernels are very jealous of their privilege to stall any process and steal its core for their own purposes. The program needs to do no system calls after startup, and not write to any mmapped memory backed by actual disks (/dev/shm and /dev/hugepages are ok), or the kernel will stall it anyway, boot flags notwithstanding.

Typically each NIC maker has its own kernel-bypass driver and library, often open-source, that understands its ring buffer. Usually they provide a .so your program can LD_PRELOAD to divert regular socket calls into their library, that you will ignore unless you want to, e.g., send out TCP traffic.

NICs have unique features. Intel and others implement a more-or-less portable DPDK interface to their library. Solarflare provides an Onload implementation, and pretty smart hardware filters. ExaNIC has less-smart filters, but delivers packets 120 bytes at a time, so you can start work on a packet before it has all arrived. A Netronome NIC can run eBPF code on a core in the NIC, against packets not even copied to host memory yet. Napatech lets you mess with the filters from the command line, while it's running, and can send packets on a nanosecond-resolution schedule.

Most let you queue up packets and trigger sending certain ones on a dime. You could have a dozen packets with different possible choices, and send only the one you later determine is right.

Lately there is a kernel service, AF_XDP, that is supposed to be a portable, kernel-maintainer approved way to do some of what the proprietary libraries do. I haven't tried it.

Getting reliable nanosecond-resolution timestamps is tricky. Nowadays everything is referenced to atomic clocks on GPS satellites. So, you need a GPS receiver, and a way for the NIC to know what it says. ExaNICs have a receiver on board. Often there is a connector for "PPS" input, expecting a clock rising or falling edge at a known offset from the second boundary. A protocol, PTP, provides ~microsecond resolution, but burns one of your 10Gbps ports. Some switches will process PTP, PPS, or GPS, and tag packets with various non-standard annotations.

If you need to do trickier things, several NICs have FPGAs you can program yourself.

[+] commandlinefan|5 years ago|reply

Linux’s TCP stack (and any other TCP stack you’re going to run across) is general-purpose: it’s designed to handle nearly anything you might want to throw at it. All that genericity comes at a price. If you want something to be hyper-efficient, it has to be specialized to the case you’re putting it to, which usually means you have to write most of it yourself. It’s really sad how many developers actively reject writing software just because somebody else wrote something else sort of similar once.

[+] aeontech|5 years ago|reply

Previously discussed on HN here with many more interesting comments

https://news.ycombinator.com/item?id=12021195

[+] nabla9|5 years ago|reply

Really good answers there.

Linux Kernel TCP stack is not just a faithful implementation of the spec. It's also all that hard learned experience about Undocumented Real Internet that is constantly evolving and what must be dealt with.

A bug in TCP stack can cause lots of problems for users. Taking responsibility of that is also huge task.

[+] kureikain|5 years ago|reply

If anyone wants to see how to write a simple userspace tcp stack, here is a good one

https://www.saminiir.com/lets-code-tcp-ip-stack-1-ethernet-a...

[+] dmurray|5 years ago|reply

Previous discussion (linked by the author) https://news.ycombinator.com/item?id=12021195

[+] ignoramous|5 years ago|reply

Related discussion: https://news.ycombinator.com/item?id=19418997

[+] oscargrouch|5 years ago|reply

In my opinion OS's should only give you mapped buffers of memory, where it defines the source.. a network driver or a file in the disk, a gpu memory, or simply the heap and the userspace applications would deal with.

OS's are doing too much in my opinion and the problem is that the status-quo of applications expect them to do it, so theres no clear way out of this.

If im not mistaken, theres a paper with the 'exokernel' design from the late nineties, built from a NetBSD, that describes a little bit about this approach, if you are curious.

Our OS's should be like that, and i bet they would be so much better giving they would be able to concentrate in being good in a much smaller spectrum, while applications would probably negotiate less with the kernel, giving us a better performance and stability overall.

TPC and sockets would be a library, and once we need to prototype new shiny stuff, it would be much better to get the needed adoption by just linking to the appropriate library.

Projects like QUIC and gRPC would be much more viral, and i bet they could even appear earlier if some part of the stack were not frozen right on the kernel.

I guess, this will become more evident with time, as a lot of those technologies on the kernel will become deprecated, while in the heat of modernism it look a great idea to stuck that thing that "everybody uses" on kernel space.

[+] secondcoming|5 years ago|reply

I looked at DPDK for our GCP instances. It seems a bit fussy and, from my understanding of it, requires a second VPC for the second network interface.

Also, one of the user-space stacks seems to be available via binary blob only.

I'm hoping io_uring will be good enough for higher QPS.

One concern with custom HTTP handling is that HTTP/2 and HTTP/3 seem to be non-trivial to hand-roll. HTTP/1 is a pretty simple request-reply design.

[+] deathanatos|5 years ago|reply

> HTTP/2 and HTTP/3 seem to be non-trivial to hand-roll. HTTP/1 is a pretty simple request-reply design.

HTTP/1 has enough edge-cases to balance that scale back out:

* chunked transfer encoding; there can be extensions with chunks

* trailers (headers that appear after the body)

* Expect & 100-continue can mean >1 "response" for a single request; also, you can't process this header for a HTTP/1.0 client, as it didn't exist then. (And you can't send it inadvertently to a /1.0 server, as you'll never get a 100-continue back!)

* HEAD is an absolute anomaly in terms of handling the response-body; you have to know you sent the HEAD, as the response will likely contain something like "Content-Length: 100" but there won't be 100 bytes of body to follow.

* If you handle obs-folded headers, obs-folded headers. Though, they're now an optional part of the specification, as they're obsolete. (But in that case, one still has to correctly 400 the request.)

Some of these exist, of course, in some manner in HTTP/2, but they're often more straight-forward to parse due to the binary nature of the protocol.

[+] tsimionescu|5 years ago|reply

HTTP is normally handled entirely in user space, both on the client and the server side, so I don't think it has much to add to this discussion.

[+] aloknnikhil|5 years ago|reply

> It seems a bit fussy and, from my understanding of it, requires a second VPC for the second network interface

Curious, why would it need a second VPC? DPDK, albeit on AWS, allows me to attach a second NIC to the instance on the same VPC.

[+] darksaints|5 years ago|reply

What tasks of a TCP stack actually need to coordinate between processes, or in other words, actually need a kernel?

From what I understand, complete kernel bypass networking is only really feasible if you only have a single process that is using the network. And with a monolithic kernel, the TCP stack needs the kernel for port allocation and load balancing the demands on the network interface.

Okay, that makes sense, but why can't most of the processing be offloaded to userspace, with the kernel just doing those specific tasks? Port allocation should happen very rarely, and throttling should only ever happen when the network interface is saturated. Wouldn't it be better to limit the kernel to just those tasks, and offload the rest to userspace?

[+] the_only_law|5 years ago|reply

Could it perhaps be even faster just to build the entire stack in hardware perhaps with an FPGA accelerator or even a custom card? I know TOE[1] is a thing, but from what I read Linux doesn't really support it.

[1]https://en.m.wikipedia.org/wiki/TCP_offload_engine

[+] notacoward|5 years ago|reply

This has been done many, many times. It works extremely well for some applications, but also often becomes an operational nightmare. Let's say that you do manage to create a hardware implementation that's fully equivalent to the kernel one at some point in time, even in terms of things like configurability and monitoring. Something that has never happened before AFAIK. Then technology changes, your needs change, or there's a security issue. Now you have to go through a special update process - assuming an update is even available - to keep up. This might not be a problem for a small deployment, but in a small deployment it probably wasn't worth going the specialized route anyway.

In a large deployment, anything that doesn't fit into the common update/remediation workflow is going to require special accommodation in code. Is the engineering cost plus the hardware cost worth it to the customer? Sometimes still yes, but more often no. There are many examples of companies who found out the hard way that the market for this kind of thing isn't big enough to recoup their own development and other costs.

P.S. This is very similar to the arguments for/against hardware RAID controllers. For whatever reasons, rightly or wrongly, those are also steadily losing popularity. Software really is eating the world.

P.P.S. In some cases, e.g. Amazon, the "smart NIC" approach is the common workflow, so the color of this argument changes. OTOH, it's also worth noting that the kinds of network filtering/virtualization/whatever that Amazon does is very specific to them and has nothing to do with any standard. They dedicate staff to support it. Bespoke ASIC/FPGA approaches aren't the same as a market in which you can sell them.

[+] tsimionescu|5 years ago|reply

I think implementing a general-purpose TCP/IP stack entirely in an FPGA is not likely to be very good return on investment, especially if you'll want stuff like filtering, (selective) packet capture etc.

However, specialized stacks for certain purposes are a relatively common thing. For example, most L23 network testing solutions use FPGAs to generate and receive massive amounts of (almost) stateless traffic (think line rate for 400GE using just 2 machines), while performing certain kinds of analysis on it (latency, loss). But these are usually just Ethernet frames or maybe IP datagrams with random payloads.

[+] luizfelberti|5 years ago|reply

I think this means that the Linux kernel only doesn't support the proprietary vendor-specific ASIC offloading mechanisms.

Linux already supports offloading even arbitrary eBPF/XDP programs to compatible NICs, which is already extensively used for DDoS mitigation (Cloudflare) or even Load Balancing (Facebook).

I'd be surprised if it didn't leverage these mechanisms to offload the kernel's own network stack as well, but haven't really checked to confirm...

[+] maven29|5 years ago|reply

https://www.theregister.com/2020/09/25/smartnic_dpu/

I was under the impression that these "SmartNICs" were already commonplace in prominent public clouds. Is this a completely unrelated use-case?

[+] dijit|5 years ago|reply

That is the case, but you need some DMA in the OS, so it’s up to the operating system driver to expose some portion of memory to data coming in with these means.

The difference between non-DMA and DMA is staggering.

[+] yencabulator|5 years ago|reply

HTTP/3 will completely change that landscape. It's essentially a TCP-stack-equivalent inside each process.

Lots of userspace TCP-stack-equivalents will be running soon, though they'll probably consists of just various versions of a couple of codebases -- your average webserver won't implement it's own low-level stack.

[+] dnautics|5 years ago|reply

I remember when the Beos moved networking from userland to the kernel because userland was too slow. Are we coming full circle?

[+] shekharshan|5 years ago|reply

To implement file system in userspace or TCP stack in userspace how does the process invoke the device driver? I am sure the Kernel steps in at some point. What system calls would you invoke from userspace TCP stack to access the NIC?

121 comments