I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.
Running the benchmarks locally (zig is fastest, but that's probably expected)
zig (~1 week older than HEAD):
./zig-out/bin/qsort
filling
shuffling
running
took 177.46ms
rust nightly:
cargo run --release --bin qsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/qsort`
filling
shuffling
running
took 896.91656ms
cargo run --release --bin rsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/rsort`
filling
shuffling
running
took 212.190694ms
go 1.16.3:
go run qsort.go
filling
shuffling
running
took 222.90356ms
on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9
The fact that the Rust tokio version (the one that uses tokio tasks instead of threads) is slow is to be expected. tokio tasks aren't appropriate for running quicksort; they will have overhead compared to a regular threadpool because of I/O reactor, waker etc code that will never get used.
rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.
Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )
Erm, could somebody explain to me why I shouldn't understand this as an argument pro go given that it is about as fast as the fully optimized versions of zig and rust?
At first glance, this looks great, almost 20% speedup!
But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.
But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.
I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?
Does this already use `target-cpu=native`? Not sure if that would be apples-to-apples with the Zig implementation (which is what's important here), but I'd be surprised if it didn't yield some interesting wins.
Rust is such a bloated, insane design-by-committee language. It's actually still surprising to me that corporations are defending it and using it. I sure won't.
It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!
I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.
To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.
[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.
I'm aware that it's technically considered blocking due to a producer being preempted between swap and store causing the consumer to see an invalid link and report empty. I address this in the run queue section by noting that it's OK if empty is spuriously reported since the thread waking/waiting mechanism takes care of rescheduling such a case.
Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.
Is this using JVM/JIT or using Graal to make a native-image binary and calling that? You'd get better results due to shaving off startup time with that if it does include it.
I think .NET 6 latest preview would slightly beat out Java here as well.
This isn't exactly apples-to-apples since Zig/Rust/Go are considered "systems languages", but the JVM has GraalVM for compiling to native binaries/libraries and even importing/exporting C functions and structs. And .NET of course has .NET Native + "Dotnet Native Exports", including the LLVM experiment where it uses LLVM bitcode instead of Ryu.
So you can make the argument that writing a native binary or a library which exported this benchmark function as a C-callable method with a C header in each language would technically equivalent.
The JVM and .NET ones would have a large size (several MB each) but would otherwise still fill this requirement.
It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.
I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.
I don't think that it's the threadpool's fault that an application uses it incorrectly. Also, I think there are a lot of developers who have not considered that on today's machines, just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case. I wouldn't say it's the case of bad software but rather software that was written for the CPUs of 5 years ago. And generally, I don't think this will be an easy problem to solve due to the variety of heterogeneous and non-heterogeneous topology of modern SoCs. In fact, I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.
What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.
Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?
Id encourage you to try recording them yourself. The results can vary depending on your system, how much concurrent tasks it can make parallel, if scheduling resources are being used elsewhere, etc. The zig code contains an example of using timers + the spawning and joining is in quickSort() function so it should hopefully be easy to add the timing logic. I can answer questions regarding it if you hop on the IRC or Discord.
In regards to the overflow queue, yes some pathologica tasks may see long delyas but this is true for any mostly-FIFO or sharded queue scenario. Both Golang and Tokio overflow from their local buffer into a shared lock-protected linked list (tokio is a bit more eager in this regard) so they can suffer similar fates. They actually do an optimization which is to check the shared queue before the local buffer every few local scheduling ticks (% 61 or 64 for each task run iirc) to decrease the change of local starvation. Could try adding that to Zig's thread pool after the timing logic and see if that helps tail latencies. I'm curious about the outcome either way, but I may not have time to work on that.
Yes, as others have said, it's based on the same open source CMS.
I've created zig.news for people who want to write about Zig but who don't necessarily want to maintain their own blog and to the work necessary to publicize it.
This should hopefully help more people write down their experience and opinions on using Zig and alleviate the problem of having a "blogger aristocracy" who has a much stronger voice than anyone else.
A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.
The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.
”Lock free” has a very precise meaning in computer science, and it’s this: ”given some number of threads of execution, an algorithm is lock free if in some finite number of steps, at least one thread is guaranteed to make forward progress”. When using mutexes, this is not guaranteed under any amount contention: the scheduler is perfectly free to preempt a thread holding a lock, which means no threads make progress.
However, if you use a CAS loop (like this code does) to update some value (i.e. a while loop that repeatedly tries to execute a compare-and-swap), that loop will only fail to make progress if some other thread DID make progress. This means that such an algorithm is in fact lock-free: if thread X didn’t progress, it is necessarily because some other thread did. In addition, a scheduler preempting a thread at any point will not stop other threads making progress.
Whether an algorithm is lock free or not is not a matter of opinion, it is a mathematical statement. The fact that it annoys you is… weird.
The former (Inversion of Control) is a general method and too unspecific to answer. I can only hint that one of the biggest caveats of async is brittle composability due to missing cancellation routines. This will eventually be addressed.
[+] [-] scott_s|4 years ago|reply
The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli.... The source code for the product implementation is now open source. Most is in https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP... and https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....
[+] [-] Jarred|4 years ago|reply
zig (~1 week older than HEAD):
rust nightly: go 1.16.3: on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9[+] [-] Arnavion|4 years ago|reply
rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.
Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )
[+] [-] stewbrew|4 years ago|reply
[+] [-] pizza234|4 years ago|reply
Running on Linux (zig -Dc), rounded average of 3 runs.
8-thread Intel laptop machine:
32-thread AMD workstation: Additionally, I haven't done a thorough review, but enabling the commented code: yields a 110ms run on the faster machine.[+] [-] rdsubhas|4 years ago|reply
But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.
But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.
I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?
[+] [-] komuW|4 years ago|reply
You should do something like; `go build . && time ./qsort`
[+] [-] erichdongubler|4 years ago|reply
[+] [-] unknown|4 years ago|reply
[deleted]
[+] [-] adenozine|4 years ago|reply
It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!
I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.
[+] [-] gpderetta|4 years ago|reply
[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.
[+] [-] kprotty|4 years ago|reply
[+] [-] _vvhw|4 years ago|reply
[+] [-] vanderZwan|4 years ago|reply
[+] [-] omazurov|4 years ago|reply
Go (go1.17.1 darwin/amd64)
Zig (0.9.0-dev.959+f011f1393) Java (openjdk version "17-ea") MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3[+] [-] gavinray|4 years ago|reply
I think .NET 6 latest preview would slightly beat out Java here as well.
This isn't exactly apples-to-apples since Zig/Rust/Go are considered "systems languages", but the JVM has GraalVM for compiling to native binaries/libraries and even importing/exporting C functions and structs. And .NET of course has .NET Native + "Dotnet Native Exports", including the LLVM experiment where it uses LLVM bitcode instead of Ryu.
So you can make the argument that writing a native binary or a library which exported this benchmark function as a C-callable method with a C header in each language would technically equivalent.
The JVM and .NET ones would have a large size (several MB each) but would otherwise still fill this requirement.
[+] [-] kevingadd|4 years ago|reply
I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.
[+] [-] eptcyka|4 years ago|reply
[+] [-] smashed|4 years ago|reply
https://youtu.be/D2c8BH8q6Yc
The tile size seems to be hard-coded to 128x128.
[+] [-] nicoburns|4 years ago|reply
[+] [-] egberts1|4 years ago|reply
And I like what I am seeing. Hopefully all transitions between states have been tested.
Good work.
[+] [-] matu3ba|4 years ago|reply
[+] [-] lmb|4 years ago|reply
Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?
[+] [-] kprotty|4 years ago|reply
In regards to the overflow queue, yes some pathologica tasks may see long delyas but this is true for any mostly-FIFO or sharded queue scenario. Both Golang and Tokio overflow from their local buffer into a shared lock-protected linked list (tokio is a bit more eager in this regard) so they can suffer similar fates. They actually do an optimization which is to check the shared queue before the local buffer every few local scheduling ticks (% 61 or 64 for each task run iirc) to decrease the change of local starvation. Could try adding that to Zig's thread pool after the timing logic and see if that helps tail latencies. I'm curious about the outcome either way, but I may not have time to work on that.
[+] [-] jedisct1|4 years ago|reply
[+] [-] r0f1|4 years ago|reply
[+] [-] kristoff_it|4 years ago|reply
I've created zig.news for people who want to write about Zig but who don't necessarily want to maintain their own blog and to the work necessary to publicize it.
This should hopefully help more people write down their experience and opinions on using Zig and alleviate the problem of having a "blogger aristocracy" who has a much stronger voice than anyone else.
I briefly go over this concept in this 3mins long video: https://www.youtube.com/watch?v=i9pUuj6eiEg
[+] [-] andruby|4 years ago|reply
[+] [-] ibraheemdev|4 years ago|reply
[+] [-] feffe|4 years ago|reply
The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.
[+] [-] OskarS|4 years ago|reply
However, if you use a CAS loop (like this code does) to update some value (i.e. a while loop that repeatedly tries to execute a compare-and-swap), that loop will only fail to make progress if some other thread DID make progress. This means that such an algorithm is in fact lock-free: if thread X didn’t progress, it is necessarily because some other thread did. In addition, a scheduler preempting a thread at any point will not stop other threads making progress.
Whether an algorithm is lock free or not is not a matter of opinion, it is a mathematical statement. The fact that it annoys you is… weird.
[+] [-] gpderetta|4 years ago|reply
edit also lock-free has not much to do with locks.
[+] [-] StreamBright|4 years ago|reply
[+] [-] matu3ba|4 years ago|reply
The latter (communicating sequential processes) is a formal language, but it looks like it is for IPC: https://arild.github.io/csp-presentation/#1
[+] [-] goodpoint|4 years ago|reply
[+] [-] chris_st|4 years ago|reply
[+] [-] randyrand|4 years ago|reply
[+] [-] posharma|4 years ago|reply
[+] [-] riofoxx|4 years ago|reply
[deleted]
[+] [-] ur-whale|4 years ago|reply
[+] [-] judofyr|4 years ago|reply
[+] [-] _57jb|4 years ago|reply
You are at the wrong level of abstraction.