Resource efficient Thread Pools with Zig

[+] scott_s|4 years ago|reply

I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.

The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli.... The source code for the product implementation is now open source. Most is in https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP... and https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....

[+] Jarred|4 years ago|reply

Running the benchmarks locally (zig is fastest, but that's probably expected)

zig (~1 week older than HEAD):

    ./zig-out/bin/qsort
    filling
    shuffling
    running
    took 177.46ms

rust nightly:

    cargo run --release --bin qsort
    warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
    See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
        Finished release [optimized] target(s) in 0.02s
        Running `target/release/qsort`
    filling
    shuffling
    running
    took 896.91656ms

    cargo run --release --bin rsort
    warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
    See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
    Finished release [optimized] target(s) in 0.02s
        Running `target/release/rsort`
    filling
    shuffling
    running
    took 212.190694ms

go 1.16.3:

    go run qsort.go
    filling
    shuffling
    running
    took 222.90356ms

on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9

[+] Arnavion|4 years ago|reply

The fact that the Rust tokio version (the one that uses tokio tasks instead of threads) is slow is to be expected. tokio tasks aren't appropriate for running quicksort; they will have overhead compared to a regular threadpool because of I/O reactor, waker etc code that will never get used.

rayon or crossbeam are more appropriate since they are actually designed for general-purpose thread work. Using rayon's scopes will also get rid of the `unsafe` that's being (ab)used to create the fake `&'static mut` slices.

Though for some reason, the rayon version (rayon_qsort.rs) uses scopes, but still uses `unsafe` to create `&'static mut` slices... ( https://github.com/kprotty/zap/pull/3 )

[+] stewbrew|4 years ago|reply

Erm, could somebody explain to me why I shouldn't understand this as an argument pro go given that it is about as fast as the fully optimized versions of zig and rust?

[+] pizza234|4 years ago|reply

I get inconsistent results, both in terms of platform and language. The benchmark may be flawed, and no conclusions should be drawn, to be prudent.

Running on Linux (zig -Dc), rounded average of 3 runs.

8-thread Intel laptop machine:

  zip 0.9.0-dev.958:                 945
  rust 1.5.0-nightly 21-09-12/qsort: 545
  rust 1.5.0-nightly 21-09-12/rsort: 255
  go 1.16.8:                         355

32-thread AMD workstation:

  zip 0.9.0-dev.958:                 410
  rust 1.5.0-nightly 21-09-12/qsort: 780
  rust 1.5.0-nightly 21-09-12/rsort: 135
  go 1.16.8:                         135

Additionally, I haven't done a thorough review, but enabling the commented code:

        // Optimized version (hooks directly into the scheduler)
        // rayon::join(
        //     || quick_sort(low),
        //     || quick_sort(high)
        // );

yields a 110ms run on the faster machine.

[+] rdsubhas|4 years ago|reply

At first glance, this looks great, almost 20% speedup!

But genuine question, this looks to be how almost every new tech begins. Initially it's FAST because it keeps things as barebones as possible, and everything else other than the specific module can be ignored.

But no feature works in isolation, so as more features get added – scheduling constraints, safety quirks, or random features that cut across the language – the 20% looks like nothing.

I did go through the blog, but how can one be sure that this 20% gap is a guaranteed design advantage, and not just a temporary deferral of everything else?

[+] komuW|4 years ago|reply

When you do `go run qsort.go` you are also timing the time taken by Golang to build/compile the code. I suspect the same applies to `cargo run`

You should do something like; `go build . && time ./qsort`

[+] erichdongubler|4 years ago|reply

Does this already use `target-cpu=native`? Not sure if that would be apples-to-apples with the Zig implementation (which is what's important here), but I'd be surprised if it didn't yield some interesting wins.

[+] unknown|4 years ago|reply

[deleted]

[+] adenozine|4 years ago|reply

Rust is such a bloated, insane design-by-committee language. It's actually still surprising to me that corporations are defending it and using it. I sure won't.

It's also just so dang ugly to read. If Perl and C++ had a mutant offspring, I think it'd be Rust. And that's just not respectful to Perl!

I'm really happy that Zig is doing so well. I'm planning on contributing a little chunk of money to the ZSF whenever they get a 1.0 landed. I do think there are some lessons to learn from Rust about formal methods as foundational building blocks in a language. I'm eager to see Zig mature more around the memory-safety side of things.

[+] gpderetta|4 years ago|reply

To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.

[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.

[+] kprotty|4 years ago|reply

I'm aware that it's technically considered blocking due to a producer being preempted between swap and store causing the consumer to see an invalid link and report empty. I address this in the run queue section by noting that it's OK if empty is spuriously reported since the thread waking/waiting mechanism takes care of rescheduling such a case.

[+] _vvhw|4 years ago|reply

Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.

[+] vanderZwan|4 years ago|reply

If they weren't before, they probably are by now after all the work put into this

[+] omazurov|4 years ago|reply

Increasing the size of the array 10x (100_000_000) and filling a glaring omission:

Go (go1.17.1 darwin/amd64)

    took 5.591593544s
    took 5.285948722s
    took 5.218750076s
    took 5.239224787s
    took 5.131232207s

Zig (0.9.0-dev.959+f011f1393)

    took 3.48s
    took 3.37s
    took 3.44s
    took 3.43s
    took 3.49s

Java (openjdk version "17-ea")

    Time: 2684 ms
    Time: 2654 ms
    Time: 2840 ms
    Time: 2676 ms
    Time: 2649 ms

MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3

[+] gavinray|4 years ago|reply

Is this using JVM/JIT or using Graal to make a native-image binary and calling that? You'd get better results due to shaving off startup time with that if it does include it.

I think .NET 6 latest preview would slightly beat out Java here as well.

This isn't exactly apples-to-apples since Zig/Rust/Go are considered "systems languages", but the JVM has GraalVM for compiling to native binaries/libraries and even importing/exporting C functions and structs. And .NET of course has .NET Native + "Dotnet Native Exports", including the LLVM experiment where it uses LLVM bitcode instead of Ryu.

So you can make the argument that writing a native binary or a library which exported this benchmark function as a C-callable method with a C header in each language would technically equivalent.

The JVM and .NET ones would have a large size (several MB each) but would otherwise still fill this requirement.

[+] kevingadd|4 years ago|reply

It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.

I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.

[+] eptcyka|4 years ago|reply

I don't think that it's the threadpool's fault that an application uses it incorrectly. Also, I think there are a lot of developers who have not considered that on today's machines, just spawning as many threads as there are cores is the optimal amount of threads in a thread pool for every use case. I wouldn't say it's the case of bad software but rather software that was written for the CPUs of 5 years ago. And generally, I don't think this will be an easy problem to solve due to the variety of heterogeneous and non-heterogeneous topology of modern SoCs. In fact, I don't see this specific threadpool doing anything to optimize for the disparate core clusters of your threadripper, or to acknowledge the disparity between core clusters on the M1.

[+] smashed|4 years ago|reply

You can change the number of processors gimp's gegl multi-threaded operations gimp will use in the preferences.

https://youtu.be/D2c8BH8q6Yc

The tile size seems to be hard-coded to 128x128.

[+] nicoburns|4 years ago|reply

Can you limit the number of cores available to an application when you open it? I feel like there must be a way to do that.

[+] egberts1|4 years ago|reply

This is the first four-state-machine thread algorithm.

And I like what I am seeing. Hopefully all transitions between states have been tested.

Good work.

[+] matu3ba|4 years ago|reply

Drawing them could help figuring out all (invalid) state transitions. Certainly would make a cool drawing.

[+] lmb|4 years ago|reply

What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.

Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?

[+] kprotty|4 years ago|reply

Id encourage you to try recording them yourself. The results can vary depending on your system, how much concurrent tasks it can make parallel, if scheduling resources are being used elsewhere, etc. The zig code contains an example of using timers + the spawning and joining is in quickSort() function so it should hopefully be easy to add the timing logic. I can answer questions regarding it if you hop on the IRC or Discord.

In regards to the overflow queue, yes some pathologica tasks may see long delyas but this is true for any mostly-FIFO or sharded queue scenario. Both Golang and Tokio overflow from their local buffer into a shared lock-protected linked list (tokio is a bit more eager in this regard) so they can suffer similar fates. They actually do an optimization which is to check the shared queue before the local buffer every few local scheduling ticks (% 61 or 64 for each task run iirc) to decrease the change of local starvation. Could try adding that to Zig's thread pool after the timing logic and see if that helps tail latencies. I'm curious about the outcome either way, but I may not have time to work on that.

[+] jedisct1|4 years ago|reply

Pretty impressive achievement!

[+] r0f1|4 years ago|reply

I know this is off-topic, but does this website not look eerily familiar to dev.to?

[+] kristoff_it|4 years ago|reply

Yes, as others have said, it's based on the same open source CMS.

I've created zig.news for people who want to write about Zig but who don't necessarily want to maintain their own blog and to the work necessary to publicize it.

This should hopefully help more people write down their experience and opinions on using Zig and alleviate the problem of having a "blogger aristocracy" who has a much stronger voice than anyone else.

I briefly go over this concept in this 3mins long video: https://www.youtube.com/watch?v=i9pUuj6eiEg

[+] andruby|4 years ago|reply

That's correct. Both are built on Forem. It's in the footer:

   Built on Forem — the open source software that powers DEV and other inclusive communities.
   Made with love and Ruby on Rails.

https://github.com/forem/forem

[+] ibraheemdev|4 years ago|reply

I'm guessing it's using forem, the open-source platform that powers dev.to: https://github.com/forem/forem

[+] feffe|4 years ago|reply

A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.

The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.

[+] OskarS|4 years ago|reply

”Lock free” has a very precise meaning in computer science, and it’s this: ”given some number of threads of execution, an algorithm is lock free if in some finite number of steps, at least one thread is guaranteed to make forward progress”. When using mutexes, this is not guaranteed under any amount contention: the scheduler is perfectly free to preempt a thread holding a lock, which means no threads make progress.

However, if you use a CAS loop (like this code does) to update some value (i.e. a while loop that repeatedly tries to execute a compare-and-swap), that loop will only fail to make progress if some other thread DID make progress. This means that such an algorithm is in fact lock-free: if thread X didn’t progress, it is necessarily because some other thread did. In addition, a scheduler preempting a thread at any point will not stop other threads making progress.

Whether an algorithm is lock free or not is not a matter of opinion, it is a mathematical statement. The fact that it annoys you is… weird.

[+] gpderetta|4 years ago|reply

Atomic instructions are not locks on the instruction level (the lock prefix on x86 is just an historical artefact).

edit also lock-free has not much to do with locks.

[+] StreamBright|4 years ago|reply

Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is it planned to have something like this? I could not find too much about that online.

[+] matu3ba|4 years ago|reply

The former (Inversion of Control) is a general method and too unspecific to answer. I can only hint that one of the biggest caveats of async is brittle composability due to missing cancellation routines. This will eventually be addressed.

The latter (communicating sequential processes) is a formal language, but it looks like it is for IPC: https://arild.github.io/csp-presentation/#1

[+] goodpoint|4 years ago|reply

How does it compare to Nim's Weave thread manager?

[+] chris_st|4 years ago|reply

Nim newby here -- does Nim have something analogous to Go's channels for inter-thread communication?

[+] randyrand|4 years ago|reply

please have Thread Priority a native feature. See GCD.

[+] posharma|4 years ago|reply

How does this compare to OpenMP? Does it support nested parallelism? (disclaimer: haven't read the post thoroughly).

[+] riofoxx|4 years ago|reply

[deleted]

[+] ur-whale|4 years ago|reply

How does it compare with a bare-metal C++ threadpool?

[+] judofyr|4 years ago|reply

What is a bare-metal C++ threadpool? Is that a specific library you're referring to?

[+] _57jb|4 years ago|reply

You've been spending too much time arguing over virtualization or bare metal.

You are at the wrong level of abstraction.

105 comments