This is a well-written article with excellent explanations and I thoroughly enjoyed it.
However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again.
This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.
This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.
The author of these splice methods, Jens Axboe, had proposed a mechanism which enabled you to determine when it was safe to reuse the page, but as far as I know nothing was ever merged. So the scenarios where you can use this are limited to those where you control both ends of the pipe and can be sure of the exact page lifetime.
I haven't digested this comment fully yet, but just to be clear, I am _not_ using SPLICE_F_GIFT (and I don't think the fizzbuzz program is either). However I think what you're saying makes sense in general, SPLICE_F_GIFT or not.
Are you sure this unsafety depends on SPLICE_F_GIFT?
Also, do you have a reference to the discussions regarding this (presumably on LKML)?
> However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again. [snip] This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.
That sounds like a security issue - the ability of an upstream generator process to write into the memory of a downstream reader process, or more perverser vice versa is even worser. I presume that the Linux kernel only lets this happen (zero copy) when the two processes are running as the same user?
I once had to change my mental model for how fast some of these things were. I was using `seq` as an input for something else, and my thinking was along the lines that it is a small generator program running hot in the cpu and would be super quick. Specifically because it would only be writing things out to memory for the next program to consume, not reading anything in.
But that was way off and `seq` turned out to be ridiculously slow. I dug down a little and made a faster version of `seq`, that kind of got me what I wanted. But then noticed at the end that the point was moot anyway, because just piping it to the next program over the command line was going to be the slow point, so it didn't matter anyway.
I had a somewhat similar discovery once using GNU parallel. I was trying to generate as much web traffic as possible from a single machine to load test a service I was building, and I assumed that the network I/o would be the bottleneck by a long shot, not the overhead of spawning many processes. I was disappointed by the amount of traffic generated, so I rewrote it in Ruby using the parallel gem with threads (instead of processes), and got orders of magnitude more performance.
Not a valid comparison between the two machines because I don't know what the original machine is, but MacOS rarely comes out shining in this sort of comparison, and the simplistic approach here giving 8 GB/s rather than the author's 3.5 GB/s was better than I'd expected, even given the machine I'm using.
The majority of this overhead (and the slow transfers) naively seem to be in the scripts/systems using the pipes.
I was worried when I saw zfs send/receive used pipes for instance because of performance worries - but using it in reality I had no problems pushing 800MB/s+. It seemed limited by iop/s on my local disk arrays, not any limits in pipe performance.
I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.
> What's also interesting is that node crashes after about a minute
I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
The following code shouldn't crash, give it a try:
If you ever need to write a random character to a pipe very fast, GNU coreutils has you covered with yes(1). It runs at about 6 GiB/s on my system:
yes | pv > /dev/null
There's an article floating around [1] about how yes(1) is extremely optimized considering its original purpose. In care you're wondering, yes(1) is meant for commands that (repeatedly) ask whether to proceed, expecting a y/n input or something like that. Instead of repeatedly typing "y", you just run "yes | the_command".
Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.
A major contributing factor is whether or not the language buffers output by default, and how big the buffer is. I don't think NodeJS buffers, whereas Python does. Here's some comparisons with Go (does not buffer by default):
- Node (no buffering): 1.2 MiB/s
- Go (no buffering): 2.4 MiB/s
- Python (8 KiB buffer): 2.7 MiB/s
- Go (8 KiB buffer): 218 MiB/s
Go program:
f := bufio.NewWriterSize(os.Stdout, 8192)
for {
f.WriteRune('1')
}
I did the same test, but added a rust and bash version. My results:
Rust: 21.9MiB/s
Bash: 282KiB/s
PHP: 2.35MiB/s
Python: 2.30MiB/s
Node: 943KiB/s
In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.
Rust:
fn main() {
loop {
print!("1");
}
}
Bash (no discernible difference between echo and printf):
It looks like it is using the "Tufte" style, named after Edward Tufte, who is very famous for his writing on data visualization.
More examples: https://rstudio.github.io/tufte/
Android's flavor of Linux uses "binder" instead of pipes because of its security model. IMHO filesystem-based IPC mechanisms (notably pipes), can't be used because of a lack of a world-writable directory - i may be wrong here.
Pipes don’t necessarily mean one has to use FS permissions. Eg a server could hand out anonymous pipes to authorized clients via fd passing on Unix domain sockets. The server can then implement an arbitrary permission check before doing this.
I usually just use cat /dev/urandom > /dev/null to generate load. Not sure how this compares to their code.
Edit: it’s actually “yes” that I’ve used before for generating load. I remember reading somewhere “yes” was optimized differently than the original Unix command as part of the unix certification lawsuit(s).
I'm glad huge pages make a big difference because I just spent several hours setting them up. Also everyone says to disable transparent_hugepage, so I set it to `madvise`, but I'm skeptical that any programs outside databases will actually use them.
I just had 25Gb/s internet installed (https://www.init7.net/en/internet/fiber7/), and at those speeds Chrome and Firefox (which is Chrome-based) pretty much die when using speedtest.net at around 10-12Gbps.
The symptoms are that the whole tab freezes, and the shown speed drops from those 10-12Gbps to <1Gbps and the page starts updating itself only every second or so.
IIRC Chrome-based browsers use some form of IPC with a separate networking process, which actually handles networking, I wonder if this might be the case that the local speed limit for socketpair/pipe under Linux was reached and that's why I'm seeing this.
> and at those speeds Chrome and Firefox (which is Chrome-based)
AFAIK, Firefox is not Chrome-based anywhere.
On iOS it uses whatever iOS provides for webview - as does Chrome on iOS.
Firefox and Safari is now the only supported mainstream browsers that has their own rendering engines. Firefox is the only that has their own rendering engine and is cross platform. It is also open source.
Chrome fires many processes and creates an IPC based comm-network between them to isolate stuff. It's somewhat abusing your OS to get what its want in terms of isolation and whatnot.
(Which is similar to how K8S abuses ip-tables and makes it useless for other ends, and makes you install a dedicated firewall in front of your ingress path, but let's not digress).
On the other hand, Firefox is neither chromium based, nor is a cousin of it. It's a completely different codebase, inherited from Netscape days and evolved up to this point.
As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).
Unrelated question, what hardware do you use to setup your network for 25Gb/s?
I've been looking at init7 for a while, but gave up and stayed with Salt after trying to find the right hardware for the job.
Is it only affecting the browser or the entire system? It might be possible that the CPU is busy handling interrupts from the ethernet controller, although in general these controllers should use DMA and should not send interrupts frequently.
[+] [-] BeeOnRope|3 years ago|reply
However, none of the variants using vmsplice (i.e., all but the slowest) are safe. When you gift [1] pages to the kernel there is no reliable general purpose way to know when the pages are safe to reuse again.
This post (and the earlier FizzBuzz variant) try to get around this by assuming the pages are available again after "pipe size" bytes have been written after the gift, _but this is not true in general_. For example, the read side may also use splice-like calls to move the pages to another pipe or IO queue in zero-copy way so the lifetime of the page can extend beyond the original pipe.
This will show up as race conditions and spontaneously changing data where a downstream consumer sees the page suddenly change as it it overwritten by the original process.
The author of these splice methods, Jens Axboe, had proposed a mechanism which enabled you to determine when it was safe to reuse the page, but as far as I know nothing was ever merged. So the scenarios where you can use this are limited to those where you control both ends of the pipe and can be sure of the exact page lifetime.
---
[1] Specifically, using SPLICE_F_GIFT.
[+] [-] rostayob|3 years ago|reply
I haven't digested this comment fully yet, but just to be clear, I am _not_ using SPLICE_F_GIFT (and I don't think the fizzbuzz program is either). However I think what you're saying makes sense in general, SPLICE_F_GIFT or not.
Are you sure this unsafety depends on SPLICE_F_GIFT?
Also, do you have a reference to the discussions regarding this (presumably on LKML)?
[+] [-] robocat|3 years ago|reply
That sounds like a security issue - the ability of an upstream generator process to write into the memory of a downstream reader process, or more perverser vice versa is even worser. I presume that the Linux kernel only lets this happen (zero copy) when the two processes are running as the same user?
[+] [-] haberman|3 years ago|reply
[+] [-] nice2meetu|3 years ago|reply
But that was way off and `seq` turned out to be ridiculously slow. I dug down a little and made a faster version of `seq`, that kind of got me what I wanted. But then noticed at the end that the point was moot anyway, because just piping it to the next program over the command line was going to be the slow point, so it didn't matter anyway.
https://github.com/tverniquet/hseq
[+] [-] freedomben|3 years ago|reply
[+] [-] spacedcowboy|3 years ago|reply
[+] [-] mhh__|3 years ago|reply
[+] [-] herodoturtle|3 years ago|reply
(And as an aside, the combination of that font with the hand-drawn diagrams is really cool)
[+] [-] zabumafew|3 years ago|reply
[+] [-] Klasiaster|3 years ago|reply
[+] [-] lazide|3 years ago|reply
I was worried when I saw zfs send/receive used pipes for instance because of performance worries - but using it in reality I had no problems pushing 800MB/s+. It seemed limited by iop/s on my local disk arrays, not any limits in pipe performance.
[+] [-] mg|3 years ago|reply
PHP comes in at about 900KiB/s:
Python is about 50% faster at about 1.5MiB/s: Javascript is slowest at around 200KiB/s: What's also interesting is that node crashes after about a minute: All results from within a Debian 10 docker container with the default repo versions of PHP, Python and Node.Update:
Checking with strace shows that Python caches the output:
Outputs a series of: PHP and JS do not.So the Python equivalent would be:
Which makes it compareable to the speed of JS.Interesting, that PHP is over 4x faster than the Python and JS.
[+] [-] capableweb|3 years ago|reply
I get around 1.56MiB/s with that code. PHP gets 4.04MiB/s. Python gets 4.35MiB/s.
> What's also interesting is that node crashes after about a minute
I believe this is because `while(1)` runs so fast that there is no "idle" time for V8 to actually run GC. V8 is a strange beast, and this is just a guess of mine.
The following code shouldn't crash, give it a try:
It's slower for me though, giving me 1.18MiB/s.More examples with Babashka and Clojure:
513KiB/s 3.02MiB/s 3.53MiB/s 5.06MiB/sVersions: PHP 8.1.6, Python 3.10.4, NodeJS v18.3.0, Babashka v0.8.1, Clojure 1.11.1.1105
[+] [-] themulticaster|3 years ago|reply
Not sure about how yes(1) compares to the techniques presented in the linked post. Perhaps there's still room for improvement.
[1] Previous HN discussion: https://news.ycombinator.com/item?id=14542938
[+] [-] cle|3 years ago|reply
- Node (no buffering): 1.2 MiB/s
- Go (no buffering): 2.4 MiB/s
- Python (8 KiB buffer): 2.7 MiB/s
- Go (8 KiB buffer): 218 MiB/s
Go program:
[+] [-] rascul|3 years ago|reply
Rust: 21.9MiB/s
Bash: 282KiB/s
PHP: 2.35MiB/s
Python: 2.30MiB/s
Node: 943KiB/s
In my case, node did not crash after about two minutes. I find it interesting that PHP and Python are comparable for me but not you, but I'm sure there's a plethora of reasons to explain that. I'm not surprised rust is vastly faster and bash vastly slower, I just thought it interesting to compare since I use those languages a lot.
Rust:
Bash (no discernible difference between echo and printf):[+] [-] abuckenheimer|3 years ago|reply
python actually buffers its writes with print only flushing to stdout occasionally, you may want to try:
which I find goes much slower (550Kib/s)[+] [-] fasteo|3 years ago|reply
[+] [-] stackbutterflow|3 years ago|reply
[+] [-] apostate|3 years ago|reply
[+] [-] bfors|3 years ago|reply
[+] [-] gigatexal|3 years ago|reply
[+] [-] sandGorgon|3 years ago|reply
Binder comes from Palm actually (OpenBinder)
[+] [-] Matthias247|3 years ago|reply
[+] [-] megous|3 years ago|reply
What's that?
A lot of programs store sockets in /run which is typically implemented by `tmpfs`.
[+] [-] marcodiego|3 years ago|reply
[+] [-] Cloudef|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] ianai|3 years ago|reply
Edit: it’s actually “yes” that I’ve used before for generating load. I remember reading somewhere “yes” was optimized differently than the original Unix command as part of the unix certification lawsuit(s).
Long night.
[+] [-] yakubin|3 years ago|reply
"tr '\0' 1 </dev/zero | pv >/dev/null" reports 1.38GiB/s.
"yes | pv >/dev/null" reports 7.26GiB/s.
So "/dev/urandom" may not be the best source when testing performance.
[+] [-] mastax|3 years ago|reply
[+] [-] sylware|3 years ago|reply
:)
[+] [-] spacechild1|3 years ago|reply
[+] [-] arkitaip|3 years ago|reply
[+] [-] jagrsw|3 years ago|reply
I just had 25Gb/s internet installed (https://www.init7.net/en/internet/fiber7/), and at those speeds Chrome and Firefox (which is Chrome-based) pretty much die when using speedtest.net at around 10-12Gbps.
The symptoms are that the whole tab freezes, and the shown speed drops from those 10-12Gbps to <1Gbps and the page starts updating itself only every second or so.
IIRC Chrome-based browsers use some form of IPC with a separate networking process, which actually handles networking, I wonder if this might be the case that the local speed limit for socketpair/pipe under Linux was reached and that's why I'm seeing this.
[+] [-] reitanqild|3 years ago|reply
AFAIK, Firefox is not Chrome-based anywhere.
On iOS it uses whatever iOS provides for webview - as does Chrome on iOS.
Firefox and Safari is now the only supported mainstream browsers that has their own rendering engines. Firefox is the only that has their own rendering engine and is cross platform. It is also open source.
[+] [-] bayindirh|3 years ago|reply
(Which is similar to how K8S abuses ip-tables and makes it useless for other ends, and makes you install a dedicated firewall in front of your ingress path, but let's not digress).
On the other hand, Firefox is neither chromium based, nor is a cousin of it. It's a completely different codebase, inherited from Netscape days and evolved up to this point.
As another test point, Firefox doesn't even blink at a symmetric gigabit connection going at full speed (my network is capped by my NIC, the pipe is way fatter).
[+] [-] implying|3 years ago|reply
[+] [-] merightnow|3 years ago|reply
[+] [-] jcims|3 years ago|reply
[+] [-] sph|3 years ago|reply
[+] [-] pca006132|3 years ago|reply
[+] [-] Spooky23|3 years ago|reply
Turned out windows 7 or the NICs needed a lot of tuning to work well. There was alot of freezing and other fail.
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] jcranberry|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] def-|3 years ago|reply
[+] [-] jve|3 years ago|reply
[+] [-] alex_hirner|3 years ago|reply