How fast are Linux pipes anyway?

[+] BeeOnRope|3 years ago|reply

This is a well-written article with excellent explanations and I thoroughly enjoyed it.

However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again.

This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.

This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.

The author of these splice methods, Jens Axboe, had proposed a mechanism which enabled you to determine when it was safe to reuse the page, but as far as I know nothing was ever merged. So the scenarios where you can use this are limited to those where you control both ends of the pipe and can be sure of the exact page lifetime.

---

[1] Specifically, using SPLICE_F_GIFT.

[+] rostayob|3 years ago|reply

(I am the author of the post)

I haven't digested this comment fully yet, but just to be clear, I am _not_ using SPLICE_F_GIFT (and I don't think the fizzbuzz program is either). However I think what you're saying makes sense in general, SPLICE_F_GIFT or not.

Are you sure this unsafety depends on SPLICE_F_GIFT?

Also, do you have a reference to the discussions regarding this (presumably on LKML)?

[+] robocat|3 years ago|reply

> However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again. [snip] This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.

That sounds like a security issue - the ability of an upstream generator process to write into the memory of a downstream reader process, or more perverser vice versa is even worser. I presume that the Linux kernel only lets this happen (zero copy) when the two processes are running as the same user?

[+] haberman|3 years ago|reply

What if the writer frees the memory entirely? Can you segv the reader? That would be quite a dangerous pattern.

[+] nice2meetu|3 years ago|reply

I once had to change my mental model for how fast some of these things were. I was using `seq` as an input for something else, and my thinking was along the lines that it is a small generator program running hot in the cpu and would be super quick. Specifically because it would only be writing things out to memory for the next program to consume, not reading anything in.

But that was way off and `seq` turned out to be ridiculously slow. I dug down a little and made a faster version of `seq`, that kind of got me what I wanted. But then noticed at the end that the point was moot anyway, because just piping it to the next program over the command line was going to be the slow point, so it didn't matter anyway.

https://github.com/tverniquet/hseq

[+] freedomben|3 years ago|reply

I had a somewhat similar discovery once using GNU parallel. I was trying to generate as much web traffic as possible from a single machine to load test a service I was building, and I assumed that the network I/o would be the bottleneck by a long shot, not the overhead of spawning many processes. I was disappointed by the amount of traffic generated, so I rewrote it in Ruby using the parallel gem with threads (instead of processes), and got orders of magnitude more performance.

[+] spacedcowboy|3 years ago|reply

Ran the basic initial implementation on my Mac Studio and was pleasantly surprised to see

  @elysium pipetest % pipetest | pv > /dev/null
   102GiB 0:00:13 [8.00GiB/s] 

  @elysium ~ % pv < /dev/zero > /dev/null
   143GiB 0:00:04 [36.4GiB/s]

Not a valid comparison between the two machines because I don't know what the original machine is, but MacOS rarely comes out shining in this sort of comparison, and the simplistic approach here giving 8 GB/s rather than the author's 3.5 GB/s was better than I'd expected, even given the machine I'm using.

[+] mhh__|3 years ago|reply

Given the machine as in a brand new Mac?

[+] herodoturtle|3 years ago|reply

This was a long but highly insightful read!

(And as an aside, the combination of that font with the hand-drawn diagrams is really cool)

[+] zabumafew|3 years ago|reply

Would definitely be curious to know the font name

[+] Klasiaster|3 years ago|reply

Netmap offers zero-copy pipes (included in FreeBSD, on Linux it's a third party module): https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4

[+] lazide|3 years ago|reply

The majority of this overhead (and the slow transfers) naively seem to be in the scripts/systems using the pipes.

I was worried when I saw zfs send/receive used pipes for instance because of performance worries - but using it in reality I had no problems pushing 800MB/s+. It seemed limited by iop/s on my local disk arrays, not any limits in pipe performance.

[+] mg|3 years ago|reply

For some reason, this raised my curiosity how fast different languages write individual characters to a pipe:

PHP comes in at about 900KiB/s:

    php -r 'while (1) echo 1;' | pv > /dev/null

Python is about 50% faster at about 1.5MiB/s:

    python3 -c 'while (1): print (1, end="")' | pv > /dev/null

Javascript is slowest at around 200KiB/s:

    node -e 'while (1) process.stdout.write("1");' | pv > /dev/null

What's also interesting is that node crashes after about a minute:

    FATAL ERROR: Ineffective mark-compacts
    near heap limit Allocation failed -
    JavaScript heap out of memory

All results from within a Debian 10 docker container with the default repo versions of PHP, Python and Node.

Update:

Checking with strace shows that Python caches the output:

    strace python3 -c 'while (1): print (1, end="")' | pv > /dev/null

Outputs a series of:

    write(1, "11111111111111111111111111111111"..., 8193) = 8193

PHP and JS do not.

So the Python equivalent would be:

    python3 -c 'while (1): print (1, end="", flush=True)' | pv > /dev/null

Which makes it compareable to the speed of JS.

Interesting, that PHP is over 4x faster than the Python and JS.

[+] capableweb|3 years ago|reply

> Javascript is slowest at around 200KiB/s:

I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.

> What's also interesting is that node crashes after about a minute

I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.

The following code shouldn't crash, give it a try:

    node -e 'function write() {process.stdout.write("1"); process.nextTick(write)} write()' | pv > /dev/null

It's slower for me though, giving me 1.18MiB/s.

More examples with Babashka and Clojure:

    bb -e "(while true (print \"1\"))" | pv > /dev/null

513KiB/s

    clj -e "(while true (print \"1\"))" | pv > /dev/null

3.02MiB/s

    clj -e "(require '[clojure.java.io :refer [copy]]) (while true (copy \"1\" *out*))" | pv > /dev/null

3.53MiB/s

    clj -e "(while true (.println System/out \"1\"))" | pv > /dev/null

5.06MiB/s

Versions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka v0.8.1, Clojure 1.11.1.1105

[+] themulticaster|3 years ago|reply

If you ever need to write a random character to a pipe very fast, GNU coreutils has you covered with yes(1). It runs at about 6 GiB/s on my system:

  yes | pv > /dev/null

There's an article floating around [1] about how yes(1) is extremely optimized considering its original purpose. In care you're wondering, yes(1) is meant for commands that (repeatedly) ask whether to proceed, expecting a y/n input or something like that. Instead of repeatedly typing "y", you just run "yes | the_command".

Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.

[1] Previous HN discussion: https://news.ycombinator.com/item?id=14542938

[+] cle|3 years ago|reply

A major contributing factor is whether or not the language buffers output by default, and how big the buffer is. I don't think NodeJS buffers, whereas Python does. Here's some comparisons with Go (does not buffer by default):

- Node (no buffering): 1.2 MiB/s

- Go (no buffering): 2.4 MiB/s

- Python (8 KiB buffer): 2.7 MiB/s

- Go (8 KiB buffer): 218 MiB/s

Go program:

    f := bufio.NewWriterSize(os.Stdout, 8192)
    for {
       f.WriteRune('1')
    }

[+] rascul|3 years ago|reply

I did the same test, but added a rust and bash version. My results:

Rust: 21.9MiB/s

Bash: 282KiB/s

PHP: 2.35MiB/s

Python: 2.30MiB/s

Node: 943KiB/s

In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.

Rust:

  fn main() {
      loop {
          print!("1");
      }
  }

Bash (no discernible difference between echo and printf):

  while :; do printf "1"; done | pv > /dev/null

[+] abuckenheimer|3 years ago|reply

> python3 -c 'while (1): print (1, end="")' | pv > /dev/null

python actually buffers its writes with print only flushing to stdout occasionally, you may want to try:

    python3 -c 'while (1): print (1, end="", flush=True)' | pv > /dev/null

which I find goes much slower (550Kib/s)

[+] fasteo|3 years ago|reply

Luajit using print and io.write

  LuaJIT 2.1.0-beta3

Using print is about 17 MiB/s

  luajit -e "while true do print('x') end" | pv > /dev/null

Using io.write is about 111 MiB/s

  luajit -e "while true do io.write('x') end" | pv > /dev/null

[+] stackbutterflow|3 years ago|reply

This site is pleasing to the eye.

[+] apostate|3 years ago|reply

It looks like it is using the "Tufte" style, named after Edward Tufte, who is very famous for his writing on data visualization. More examples: https://rstudio.github.io/tufte/

[+] bfors|3 years ago|reply

Love the subtle "stonks" overlay on the first chart

[+] gigatexal|3 years ago|reply

Now this is the kind of content I come to HN for. Absolutely fascinating read.

[+] sandGorgon|3 years ago|reply

Android's flavor of Linux uses "binder" instead of pipes because of its security model. IMHO filesystem-based IPC mechanisms (notably pipes), can't be used because of a lack of a world-writable directory - i may be wrong here.

Binder comes from Palm actually (OpenBinder)

[+] Matthias247|3 years ago|reply

Pipes don’t necessarily mean one has to use FS permissions. Eg a server could hand out anonymous pipes to authorized clients via fd passing on Unix domain sockets. The server can then implement an arbitrary permission check before doing this.

[+] megous|3 years ago|reply

"lack of a world-writable directory"

What's that?

A lot of programs store sockets in /run which is typically implemented by `tmpfs`.

[+] marcodiego|3 years ago|reply

History of binder is more involved and has its seeds on BeOS IIRC.

[+] Cloudef|3 years ago|reply

I've dumped pixels and pcm audio through a pipe, it certainly was fast enough for that https://git.cloudef.pw/glcapture.git/tree/glcapture.c (I suggest gamescope + pipewire to do this instead nowadays however)

[+] unknown|3 years ago|reply

[deleted]

[+] ianai|3 years ago|reply

I usually just use cat /dev/urandom > /dev/null to generate load. Not sure how this compares to their code.

Edit: it’s actually “yes” that I’ve used before for generating load. I remember reading somewhere “yes” was optimized differently than the original Unix command as part of the unix certification lawsuit(s).

Long night.

[+] yakubin|3 years ago|reply

On 5.10.0-14-amd64 "pv < /dev/urandom >/dev/null" reports 72.2MiB/s. "pv < /dev/zero >/dev/null" reports 16.5GiB/s. AMD Ryzen 7 2700X with 16GB of DDR4 3000MHz memory.

"tr '\0' 1 </dev/zero | pv >/dev/null" reports 1.38GiB/s.

"yes | pv >/dev/null" reports 7.26GiB/s.

So "/dev/urandom" may not be the best source when testing performance.

[+] mastax|3 years ago|reply

I'm glad huge pages make a big difference because I just spent several hours setting them up. Also everyone says to disable transparent_hugepage, so I set it to `madvise`, but I'm skeptical that any programs outside databases will actually use them.

[+] sylware|3 years ago|reply

yep, you want perf? Don't mutex then yield, do spin and check your cpu heat sink.

:)

[+] spacechild1|3 years ago|reply

Maybe a stupid question, but why aren't pipes simply implemented as a contiguous buffer in a shared memory segment + a futex?

[+] arkitaip|3 years ago|reply

The visual design is amazing.

[+] jagrsw|3 years ago|reply

Something maybe a bit related.

I just had 25Gb/s internet installed (https://www.init7.net/en/internet/fiber7/), and at those speeds Chrome and Firefox (which is Chrome-based) pretty much die when using speedtest.net at around 10-12Gbps.

The symptoms are that the whole tab freezes, and the shown speed drops from those 10-12Gbps to <1Gbps and the page starts updating itself only every second or so.

IIRC Chrome-based browsers use some form of IPC with a separate networking process, which actually handles networking, I wonder if this might be the case that the local speed limit for socketpair/pipe under Linux was reached and that's why I'm seeing this.

[+] reitanqild|3 years ago|reply

> and at those speeds Chrome and Firefox (which is Chrome-based)

AFAIK, Firefox is not Chrome-based anywhere.

On iOS it uses whatever iOS provides for webview - as does Chrome on iOS.

Firefox and Safari is now the only supported mainstream browsers that has their own rendering engines. Firefox is the only that has their own rendering engine and is cross platform. It is also open source.

[+] bayindirh|3 years ago|reply

Chrome fires many processes and creates an IPC based comm-network between them to isolate stuff. It's somewhat abusing your OS to get what its want in terms of isolation and whatnot.

(Which is similar to how K8S abuses ip-tables and makes it useless for other ends, and makes you install a dedicated firewall in front of your ingress path, but let's not digress).

On the other hand, Firefox is neither chromium based, nor is a cousin of it. It's a completely different codebase, inherited from Netscape days and evolved up to this point.

As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).

[+] implying|3 years ago|reply

Firefox is not based on the chromium codebase, it is older.

[+] merightnow|3 years ago|reply

Unrelated question, what hardware do you use to setup your network for 25Gb/s? I've been looking at init7 for a while, but gave up and stayed with Salt after trying to find the right hardware for the job.

[+] jcims|3 years ago|reply

Speedtest does have a CLI as well, might be interesting to compare them.

[+] sph|3 years ago|reply

This makes me wonder... does anyone offer an iperf-based speedtest service on the Internet?

[+] pca006132|3 years ago|reply

Is it only affecting the browser or the entire system? It might be possible that the CPU is busy handling interrupts from the ethernet controller, although in general these controllers should use DMA and should not send interrupts frequently.

[+] Spooky23|3 years ago|reply

I ran into this with a VDI environment in a data center. We had initially delivered 10Gb Ethernet to the VMs, because why not.

Turned out windows 7 or the NICs needed a lot of tuning to work well. There was alot of freezing and other fail.

200 comments