top | item 20945819

A high-speed network driver written in C, Rust, Go, C#, Java

444 points| Sindisil | 6 years ago |github.com | reply

163 comments

order
[+] kyrra|6 years ago|reply
There is discussion about C# and Java being faster than Go, but one interesting thing to note is that both C# and Java have to use C to interface with the kernel.

Java: https://github.com/ixy-languages/ixy.java/blob/master/ixy/sr...

C#: https://github.com/ixy-languages/ixy.cs/blob/master/src/ixy_...

Java needs a bit more C to make it work. C# only seems to need it for DMA access. But when you look at the Go code, they got away with being pure Go and using the syscall and unsafe package. So that's at least one plus for Go.

(the main readme calls this out, but at least one thing worth mentioning here too).

As a Java coder for my day-job, I do like the breakdown they have of the performance of the different GCs for their Java implementation. https://github.com/ixy-languages/ixy-languages/blob/master/J...

[+] pjmlp|6 years ago|reply
You can perfectly do in same in C# just like they did with Go.

We do it all the time in Windows low level coding within .NET, the reason why they didn't beats me, most likely not knowledgeable enough of .NET capabilities.

As for Java, hopefully with projects Valhala, Panama and Metropolis Java will finally have the performance language features that should have been part of Java 1.0.

[+] emmericp|6 years ago|reply
Neither C# nor Java have C in the hot path; C# uses unsafe mode, Java sun.misc.Unsafe. JNI/C# native calls are either in the initialization or for an alternate implementation to compare them.
[+] dep_b|6 years ago|reply
I can't really believe the Swift implementation needs to be that slow. Objective-C used to be 100% C compatible and Swift more or less has complete bridging to C because of the need to use these API's.

Objective-C was often called slow because iteration NSArray was much slower than doing it in C. Well, if you needed to do it fast in Objective-C you wouldn't do it using the user friendly and safe (for 1984) higher level objects.

I think only Rust really allows you to write really safe and still really fast code though.

[+] oaiey|6 years ago|reply
The Techempower Benchmarks covering a more professional state on that. When coded right, the frameworks typically are on-par in performance in the Plaintext area (where only processing matters). In the end, these drivers were mostly thesis documents.
[+] kartickv|6 years ago|reply
It would be good to have them measure aspects other than performance, like how long it took to build each, was there a learning curve because it's an unfamiliar language, how secure the resulting code is, etc.
[+] tylerl|6 years ago|reply
If you can't see or interpret the graphs (mobile browser, etc.) here's a quick description of the relative performance in terms that might be useful even without the graphs.

Bidirectional forwarding, Packets per second: Here, the batch size matters; small batches have a lower packet rate across the board. Each language has increasing throughput with increasing batch size up to some point, and then the chart goes flat. Python is by far the slowest, not even diverging from the zero line. C is consistently the fastest, but flattens out at 16-packet batch at 27Mpps. Rust is consistently about 10% slower than C until C flattens out, then Rust catches up at the 32-packet batch size, and both are flat at 27Mpps. Go is every so slightly faster than C# until the 16-packet batch size where they cross (at 19Mpps), then C# is consistently about 2Mpps faster than Go. At the 256-packet batch size, C# reaches 27Mpps, and Go 25Mpps. Java is faster than C# and Go at very low batch sizes, but at 4 packets per batch Java slows down (10Mpps), and quickly reaches its peak of 11 to 12 Mpps. OCaml and Haskell follow a similar curve, with Haskell consistently about 15% slower than Java, and Ocaml somewhere between the two. Finally, Swift and Javascript are indistinguishable from each other, both about half the speed of Haskell across the board.

Latency, at 90, 99, 99.9, 99.99.. etc., percentile. 1Mpps: All have zero-ish latency at the 90 percentile point, then Javascript latency quickly jumps to 150us, then again at 99.99%ile jumps again to 300us. C# is the next to increase: at the 99%ile mark there's a steady increase till it hits 40us at 99.99%ile. Then a steady increase to about 60us. Haskell keeps it at about 10us until 99.99%ile, then a steady increase to about 60us, and a sudden spike at the end to 250us. Java latency remains low until 99.95%ile, then it quickly spikes up reaching a max of 325us. Next OCaml spikes at around 99.99%ile, reaching a max of about 170us. Next comes Swift, with a maximum of about 70us. Finally, C, Rust, and Go have the lowest latency. Rust and C are indistinguishable, and Go latency diverges to about 20% higher than the other two at the 99.999%ile mark, where it sways, eventually hitting around 25us while C and rust hit about 22us.

[+] userbinator|6 years ago|reply
Cross-language comparisons are always interesting to look at; if I had the time, I'd really like to write one in Asm and see how it compares.

I've written NIC drivers for some older chipsets, and IMHO it's not something that's particularly "algorithmic" in computation or could necessarily show off/exercise a programming language well; what's really measured here is probably an approximation to how fast these languages can copy memory, because that's ultimately what a NIC driver mostly does (besides waiting.) To send, you put the data in a buffer and tell the NIC to send it. To receive, the NIC tells you when it has received something, and you copy the data out. Nonetheless, the astonishingly bad performance of the Python version is surprising.

Although I haven't looked at the source in any detail, I know that newer NICs do a lot more of the processing (e.g. checksums) that would've been done in the host software, so that would be another way in which the performance of the host software wouldn't be evident.

One other thing I'd like to see is a chart of the binary sizes too (with and without all the runtime dependencies).

[+] emmericp|6 years ago|reply
Real NIC drivers spend most of their time fiddling with bit fields. It's mostly about translating a hardware-agnostic version of a packet descriptor (mbuf, sk_buffs, ...) into a hardware-specific DMA descriptor.

If your driver copies memory you are doing something wrong.

[+] bsder|6 years ago|reply
> Nonetheless, the astonishingly bad performance of the Python version is surprising.

In the paper, they point out that the Python version is the only one they didn't bother to optimize.

However, my takeaway is that practically everybody can handle north of 1 Gigabits per second (2 Million packets per second x 64 bytes per packet) even on a 1.6GHz core. I find THAT quite a bit more astonishing actually.

[+] ummonk|6 years ago|reply
Yeah, for real life applications as well, I shy away from Python for anything where performance might one day be an issue. Most languages can at least get within an order of magnitude of state of the art (at which point ergonomic considerations can matter more), but Python is just incredibly slow in practice.
[+] saurik|6 years ago|reply
That JavaScript and Swift have essentially the same performance here is extremely telling: there are essentially four performance regimes (five if you count Python, but clearly from the graphs you should not ;P), and what would really be interesting--and which this page isn't bothering to even examine?! :(--is what is causing each of these four regimes. I want to know what is so similar about C# and Go that is causing them to have about the same performance, and yet much more performance (at higher batch sizes) than the regime of Java/OCaml/Haskell (a group which can't be explained by their garbage collectors as one of the garbage collectors tested for Java was "don't collect garbage" and it had the same performance). It frankly makes me expect there to be some algorithmic difference between those two regimes that is causing the difference, and it has nothing to do with language/runtime/fundamental performance.
[+] antoinealb|6 years ago|reply
The author of this project presented it last year at CCC, here is the video: https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...
[+] ksangeelee|6 years ago|reply
Thanks, that was interesting. If anyone is excited enough to try driving peripherals in userspace via hardware registers, I can recommend starting with a Raspberry Pi, since it has several well documented peripherals (UART, SPI, I2C, DMA, and of course lots of GPIO), and the techniques described in this talk are transferable.

A search for 'raspberry pi mmap' will yield a lot of good starting points.

[+] kerng|6 years ago|reply
Cool to see C# being up there close to C and before Golang.

I haven't used C# much over the last year due to job change but always felt like one of the most mature languages out there. Now working in Go and it's a bit frustrating in comparison.

[+] tylerl|6 years ago|reply
Go isn't designed to feel mature, it's designed to be boring and effective. It's designed to keep code complexity low even as the complexity of problems and solutions increases. It's designed to allow large teams of medium-skill programmers to consistently produce safe and effective solutions. The most precise description ive heard to date is: "Go is a get shit done language."
[+] chrisaycock|6 years ago|reply
A specific finding from this research is on the front page:

https://news.ycombinator.com/item?id=20944403

Rust was found to be slightly slower than C because of bounds checking, which the compiler keeps even in production builds.

[+] mlindner|6 years ago|reply
Except their answer is wrong, because Rust (LLVM rather) does eliminate bounds checks. They're comparing GCC vs LLVM here more than they are comparing C vs Rust. They should have compiled their C code in LLVM. Their implementation is littered with uses of "unsafe" which means its almost impossible for the compiler to eliminate the bounds checks.
[+] chvid|6 years ago|reply
So why the difference in "language" speeds?

You have some the results not quite following the conventional expectation. For example the Swift implementation is as slow as JavaScript. JavaScript is a lot faster than Python. Java is considerable slower than the usually very similar C#.

The implementation is fairly complex; so it is a bit hard to see what is going on. But it must be possible to pin the big performance differences implied by the two graphs to something?

[+] ygra|6 years ago|reply
Python is interpreted bytecode. This means that for every small instruction on the bytecode there's a round trip to the Python interpreter that has to execute that instruction. This is faster than parsing and interpreting at the same time, such as shells often do, but it's still a lot slower than JIT compilers.

Now, a just-in-time (JIT) compiler transforms the code into machine code at runtime. Usually from bytecode. Java, C# JavaScript all use this model predominantly these days. This takes a bit of work during runtime and you cannot afford too complicated optimizations that a C or C++ compiler would do, but it comes close (and for certain reasons is even better sometimes). So that's the main reason why JavaScript is faster than Python. Theres a Python JIT compiler, PyPy, that might close the gap, though. And for Python in particular there are also other options to improve speed somewhat, one of them involves converting the Python code to C. Not too idiomatic, usually, though.

As for Java and C#, that's a point where it can sometimes show that C# has been designed to be a high-level language that can drop down to low levels if needed. C# has pointers and the ability to control memory layout of your data, if you need it. This turns off a lot of niceties and safeties that the language usually offers (you also need the unsafe keyword, which has that name for a reason), but can improve speed. Newer versions of C# increasingly added other features that allow you to safely write code that performs predictably fast. But even value types and reified generics go a long way of making things faster by default than being required to always use classes and the heap.

Java on the other hand has few of those features where the develop is offered low-level control. It has one major advantage, though, in that its own JIT compiler is a lot more advanced and can do some crazy transformations and optimizations. One might argue that Java needs that much magic because you don't have much control at the language level to make things fast, so as far as performance goes between C# and Java this may be pretty much the tradeoff between complicated language and complicated JIT compiler.

As for which benchmark shows Java being faster than C# depends a bit on how the code was written, but recently .NET has become a lot better as well and popular multi-language benchmarks show C# often faster than Java.

[+] csande17|6 years ago|reply
I'd imagine Python is so slow in this benchmark because it doesn't have any kind of optimizing compiler. All the other languages are either compiled ahead of time or just-in-time compiled into more efficient machine code.

I wonder how PyPy would do on this benchmark...

[+] jsiepkes|6 years ago|reply
I find the performance of Java rather suspicious. It starts out fast for the smallest batch's sizes but then kind of falls flat for the rest.
[+] AlEinstein|6 years ago|reply
Surprisingly good performance for the C# implementation!
[+] fgonzag|6 years ago|reply
Unsurprising if you've kept up with .net The new primitive types (spans) allow direct low level manipulation of memory slices. A NIC driver, at it's core, really only copies data to and from shared buffers, so it gets a tremendous benefit from this new type.

C# recently getting new low level memory types definitely gave it the edge there, it does not reflect real world scenarios very accurately.

[+] jcranmer|6 years ago|reply
For me, that was the line that surprised me the most. The .NET VM has had a reputation as being a worse variant of the JVM, but it seems that now the tables have turned.
[+] BuckRogers|6 years ago|reply
As someone on Team C#, I bet my career on it after careful consideration and comparison with every other option I had on the table, I had the exact same thought. But it's Microsoft, in my opinion they know software better than anyone around. They cover a lot of ground and fail sometimes but overall I consistently have high expectations from them. I use their platform everyday. Good work, Microsoft!
[+] molyss|6 years ago|reply
That's a very interesting experiment in many levels. Haven't taken the time to look at the paper yet, but I'm curious of how you got your number of pps vs Gb/s in the README :

"full bidirectional load at 20 Gbit/s with 64 byte packets (29.76 Mpps)". sounds like 20Gb/s should be closer to 40Mpps than to 30Mpps. Did you hit CPU limits on the packet generator, or am I missing some packet header overhead ?

Did you try bigger that 64-byte packets ? I'm curious how various runtimes would handle that.

And how long did you run the benchmarks ? I couldn't really figure it out from the github or the paper. Mostly wondering if java and other Gc'd language showed improvement or degradation over time. I could see the JITs kicking in, but I could also see the GCs causing latency spikes.

[+] benou|6 years ago|reply
> am I missing some packet header overhead ?

Yes: Ethernet adds 20 bytes: 8 byte preamble/start of frame delimiter + 12 byte interframe gap

=> the "on-the-wire" size is actually 84-bytes

=> 20Gbps/84-bytes = 29.76Mpps

> Did you try bigger that 64-byte packets ? I'm curious how various runtimes would handle that.

In typical forwarding, packet size does not impact forwaring that much until you hit some bandwidth limit (PCIe, DDR and/or L3 cache) because you only touch the packet header (typically the 1st 64-bytes cacheline in the packet). The data transfer itself will be done by NIC DMA.

[+] yaantc|6 years ago|reply
You're only considering the MAC level size of 64 bytes but there is also the physical layer overhead, which pushes the effective size of a packet to 84 bytes (see [1]) with a 7 bytes preamble, a 1 byte start of frame delimiter and 12 bytes of inter-packets gap. If you use 84 bytes with 20 Gbps you get the 29.76 Mpps.

[1] https://en.wikipedia.org/wiki/Ethernet_frame

[+] azhenley|6 years ago|reply
The fact that Go is slower than C# really amazes me! Not long ago I switched from C# to Go on a project for performance reasons, but maybe I need to go back.
[+] zeeboo|6 years ago|reply
It’s only slower at the highest batch sizes. I’d say their performance on throughput here is comparable, except the Go version has much better latencies (don’t be confused by the first graph like I was: that green line at the top is actually javascript).
[+] apta|6 years ago|reply
What made you come to the conclusion that golang was faster than C#? The hype and claims we see in blogs that are not backed up by anything?

Both C# and Java are faster than golang.

[+] non-entity|6 years ago|reply
Not sure of this was a web project, but I imagine when you add an entire framework and web server, you may see less performance than a small binary, regardless of the respective language speed
[+] non-entity|6 years ago|reply
Is there a compelling reason to write high level user mode drivers like this over traditional kernel drivers? I remember finding this repo a few years back and being fascinated.
[+] steveklabnik|6 years ago|reply
From the paper: https://www.net.in.tum.de/fileadmin/bibtex/publications/pape...

> Drivers are written in C or restricted subsets of C++ on all production-grade server, desktop, and mobile operating systems. They account for 66% of the code in Linux, but 39 out of 40 security bugs related to memory safety found in Linux in 2017 are located in drivers. These bugs could have been prevented by using high-level languages for drivers.

[+] Shorel|6 years ago|reply
Rust has definitely earned my respect.

Someone add D lang to this test! I want to know!

[+] mister_hn|6 years ago|reply
It misses C++
[+] pjmlp|6 years ago|reply
While C++ is way better than using C, it doesn't forbid "writing C with C++ compiler", does rendering useless all the safety features it offers, if one isn't allowed to tame the team via static analysis tooling.
[+] yc12340|6 years ago|reply
I am calling in question validity of this project as a benchmark.

The author asserts, that "it's virtually impossible to write allocation-free idiomatic Java code, so we still allocate... 20 bytes on average per forwarded packet". This sounds questionable, — does that mean, that he actually performs a JVM memory allocation for _every_ packet?! Furthermore, the specifics of memory management look murky. One implementation uses "volatile" C writes [1] (simply storing data to memory). Another implementations of the same thing uses a full CPU memory barrier [2]. Which one is right?

In my opinion, significant inconsistencies between implementations render any comparison between them invalid. And when a whole cross-language test suite is written by one person, you can be sure, that they don't really excel in many of those languages.

This is why I like Benchmark Game — all benchmarks are submitted by users, so they are a lot closer to how a real-world decent programmers can solve the problem. Still not perfect, but at least that counts as an attempt.

1: https://github.com/ixy-languages/ixy.java/blob/fcad50339e537...

2: https://github.com/ixy-languages/ixy.java/blob/fcad50339e537...

[+] emmericp|6 years ago|reply
Java reaches 52% of C speed in the benchmark game ("fastest measurement at the largest workload" data set, geometric mean), we reach 38%. Seems like our implementation is within a reasonable range for something that's usually not done in Java.

A full memory barrier is not required, but some languages only offer that. For example, go had the same problem. It's not a bottleneck because it goes to MMIO PCIe space which is super slow anyways (awaits a whole PCIe roundtrip).

And no, it obviously wasn't written by only one person but a team of 10.

No, we are not saying that we allocate for every packet. We say that we allocate 20 bytes on average per packet.

[+] masklinn|6 years ago|reply
> The author asserts, that "it's virtually impossible to write allocation-free idiomatic Java code, so we still allocate... 20 bytes on average per forwarded packet". This sounds questionable

The code is public, I'm sure they'd be happy to have your insight and fix this issue, it doesn't seem like they were happy about it.