Lock-Free Rust: Crossbeam in 2019

[+] pcwalton|7 years ago|reply

People (including me) always talk about how memory safety without GC is the most important feature of Rust, but just as important is how easy Rust makes fine-grained parallelism. In Rust, you can often start off writing sequential code and then bolt parallelism on after the fact and have it all "just work" with improved performance, thanks to libraries like Crossbeam and Rayon and the way the ownership/borrowing rules enforce data race freedom. That's really powerful.

[+] est31|7 years ago|reply

And if it doesn't "just work" then there will be a clear error message pointing to the problem instead of nasal demons haunting you. That way you can debug and fix the problem much more easily.

[+] deepsun|7 years ago|reply

Wait, but Crossbeam does have a GC inside it.

[+] 2bitencryption|7 years ago|reply

Allow me to ask a dumb question:

At the end of the day, how can something that is safely concurrent also be lock-free? At the lowest level, what is the primitive that enforces the safety, if it's not a lock or mutex?

My brain starts to run in cricles when I think of this scenario: two threads trying to write to one piece of data. To do so safely, they need to take turns. Therefore, one has to wait for the other to complete. Therefore, one must acquire a mutually exclusive... hold on, that's a mutex!

Can someone please clear this up for me?

[+] coder543|7 years ago|reply

One word: atomics.

A simple example would be concurrently writing into a shared queue. To make it really simple, let's assume that this queue's buffer can only be written once, and then the program has to exit.

If we have a buffer in this queue that can hold 20 messages, and an atomic integer that represents the current index, then we could have two (or however many) threads writing into it at the same time by just doing "index = myAtomic.fetch_add(1)", which will atomically add one to the index, then return the previous value. Atomics are supported at a hardware level, so they're generally pretty efficient, and there definitely is no lock like a Mutex involved here. In the end, both threads are able to write into shared memory without conflicting with each other. Using one or two more atomic integers, we could support having one or more readers reading off of the queue at the same time that we're writing to it, and we could get to the point where we're able to start reusing the buffer in a circular fashion.

[+] wahern|7 years ago|reply

For a general overview of what lock-free and wait-free algorithms mean, see https://en.wikipedia.org/wiki/Non-blocking_algorithm.

Much of the foundational theoretical work in this area was done by Maurice Herlihy in the 1980s and 1990s. For example, https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf I briefly paid for an ACM membership just so I could read his original work. (Well, I paid so I could read the literature in transactional memory, but I mostly ended up reading papers authored by Herlihy, co-authored by Herlihy, or analyzing Herlihy's work.)

[+] jaccarmac|7 years ago|reply

I've read one of these articles and bookmarked the other. They give some introduction to the problem space, with pictures. I'm no expert though and can't rate their accuracy; Just know they helped me feel a little more confident about the basics.

https://preshing.com/20120612/an-introduction-to-lock-free-p...

http://lucteo.ro/2018/11/18/look-ma-no-locks/

[+] elteto|7 years ago|reply

FYI lock-free programming does not mean "free of locks" but rather "free of lockups", as in, there isn't any possible arrangement in which two threads could end up blocking each other indefinitely (as could happen with mutexes and deadlocks).

From [0]: "An algorithm is lock-free if, when the program threads are run for a sufficiently long time, at least one of the threads makes progress (for some sensible definition of progress)."

[0] https://en.wikipedia.org/wiki/Non-blocking_algorithm

[+] kllrnohj|7 years ago|reply

What are you calling the "lowest level"? Because at the CPU level there is no arbitrary scope mutex. That's not a thing that exists at the lowest levels. The mutex/lock that exists in basically every language is actually built on top of atomics, that's your "building block" so to speak.

There is an x86 instruction prefix named 'lock', but it only applies to a single instruction and is considered in the family of being atomic rather than a mutex.

[+] nine_k|7 years ago|reply

You can't completely remove locks. You can make them very small and hold them for a very short tine using hardware support. Atomic increments, atomic compare-and-exchange are supported by modern CPUs.

On that base, you can build lock-free data structures.

[+] ncmncm|7 years ago|reply

There are some pretty good answers to this question posted, but the simple one is that there is, really, no such thing as a lock-free anything.

When people say "lock-free", they mean they rely on basic hardware primitives that end up doing pretty much what you would do with locks in your code, but you can't see it. Generally, the hardware can do it faster, but rarely as much faster as people think, or want. In particular, you will still have stalls -- 30 cycles, 100 cycles, sometimes longer, whenever anything clashes, "lock-free" or no. Everything you do to cut contention at your level will pay just as it would if you were using regular locking. If you can cut your contention enough, lock-free ends up little faster (although it can still reduce latency spikes).

If the lock-free stuff is buried in a library well-maintained by somebody else who is terrifyingly competent, then use it. But it is very, very unlikely that maintaining your own will be worthwhile; the most likely outcome is extremely subtle bugs. And who cares how fast buggy code is? There are much easier ways to make wrong code fast.

[+] jph|7 years ago|reply

Excellent! Crossbeam is for concurrent programming.

Crossbeam provides lock-free data structures (e.g. ArrayQueues), thread synchronization (e.g. ShardedLock), memory sharing (e.g. AtomicCell), and utilties (e.g. Backoff).

Thank you to the author and team! Next I am very much looking forward to a Crossbeam implementation of the LMAX Disruptor ring queue: https://lmax-exchange.github.io/disruptor/

[+] strictfp|7 years ago|reply

LMAX? Really? Did you get LMAX to work well? I never did.

[+] nindalf|7 years ago|reply

> So these are the things I’d like to see in Crossbeam this year: AtomicReference and ConcurrentHashMap written in Rust.

Really looking forward to ConcurrentHashMap.

[+] pedrocr|7 years ago|reply

I've hacked around this issue in a contended hashmap by doing a Vec<RwLock<HashMap>> where you index into the Vec with the first bits of the key hash and then use the HashMap within it:

https://github.com/pedrocr/syncer/blob/master/src/rwhashes.r...

Worked fine but a full ConcurrentHashMap would be much nicer.

[+] pas|7 years ago|reply

https://github.com/jonhoo/rust-evmap ?

[+] tombert|7 years ago|reply

Lock-free structures have always felt like the "holy grail" of concurrent programming. I remember being blown-away when I read through the paper on CTries (which I'm assuming ConcurrentHashMap would be based on), and even more blown away about how well they performed.

I always assumed that Ctries basically necessitated a GC, but I am very happy to be wrong about this!

[+] presscast|7 years ago|reply

I've been looking for something like this in Go (yes, yes, I know... lack of generics...). Does such a thing exist?

[+] majewsky|7 years ago|reply

I doubt it. Go has good performance for most applications out of the box, but if you're hitting the limits of what `chan` can do, it's either about to get very ugly (which goes against everything that Go stands for) or you should be looking at something like Rust at least for the hot paths.

[+] jasonwatkinspdx|7 years ago|reply

There are some similar lock free data structures in golang up on github such as "onering".

One challenge of writing this kind of code in go is that go routines are not pre-emptive, so you can't just port over an implementation that works in c.

[+] crimsonalucard|7 years ago|reply

Forgive me, I'm not that familiar with rust, but I assumed that borrow checking got rid of the notion of two threads sharing the same data structure and therefore got rid of the need for locks? What's going on with this library? Are locks often used in rust?

[+] pcwalton|7 years ago|reply

The borrow checker allows two threads to share the same data structure immutably without locks. If you need to mutate the data structure from multiple threads, you need some kind of synchronization, and the borrow checker enforces that this synchronization is present. Rust doesn't prevent you from using synchronized shared mutable state, though of course if you want to program without any shared mutable state that's a perfectly valid option too.

[+] winstonewert|7 years ago|reply

No, Rust has locks and shared data structures. What it does is enforce their usage. It will be a compile error if you try to modify the same data structure from multiple threads without a lock.

[+] strictfp|7 years ago|reply

Rust enforces exclusive access to a variable before it allows mutation.

But there are plenty of data types like smart pointers, where multiple copies of the original smart pointer points to the same underlying data.

In the eyes of the compiler, these are different variables, and can be manipulated independently.

When writing said smart pointer, however, Rust won't allow you to share data between copies unless you explicitly ask for disabling some of the compile checks. When you do this, you need to manually ensure (thread) safety. And this is where locks and other concurrency primitives generally come in handy.

[+] geofft|7 years ago|reply

The borrow checker tracks two kinds of borrows (in addition to ownership), &T for a constant reference and &mut T for a mutable reference. But it's not really about constant/mutable, it's really about shared/unique or, if you prefer, reader/writer. Essentially the borrow checker is a compile-time reader-writer lock: multiple threads or multiple structures can have a read borrow at once, but only one can have a write borrow and you need to get rid of the read borrows first.

But there are common use cases where you need to have concurrent write access to the same structure from multiple threads. The way to implement this is "interior mutability," where you write a method that takes an &self - a shared reference - but can still mutate it. A simple example of this is the std::sync::atomic types (https://doc.rust-lang.org/std/sync/atomic/). There's a method AtomicUsize::compare_and_swap that takes an &self, not an &mut self, as well as the values for comparing and swapping. Because it executes in an atomic machine instruction (or otherwise figures out how to make it atomic), it's okay for compare_and_swap to take only a &self. There's no risk of data races/memory corruption from multiple threads calling it at once. You're dynamically enforcing safety: if the compare doesn't work, the compare_and_swap operation returns an error. There's also a method AtomicUsize::get_mut(&mut self) -> &mut usize that takes a unique reference to the AtomicUsize type, and gives you a unique reference to the underlying integer. As long as you hold a unique borrow, you can do normal memory read/write operations to the integer inside it, but at this point the compiler will statically make sure that nobody else can take shared references to the same AtomicUsize.

You can always do what you want via message-passing between threads, but sometimes you want higher performance. (I guess that's why you're using threads in the first place; you can always write code in a single-threaded manner if you want.) std::sync::atomic and a few other standard library types help you with this; crossbeam really helps you with this. And they all integrate into the borrow checker pretty smoothly.

See also the Rust book's chapter on interior mutability https://doc.rust-lang.org/book/ch15-05-interior-mutability.h... and the "Too Many Lists" exercise https://cglab.ca/~abeinges/blah/too-many-lists/book/README.h... for more examples.

[+] CyberDildonics|7 years ago|reply

> the blog post titled Lock-freedom without garbage collection, which demonstrates that one doesn’t need a language with a tracing garbage collector to write fast lock-free programs. The secret sauce is a technique called epoch-based garbage collection, which is much different from traditional garbage collectors and is easily implemented as a library.

I'm not sure where there is a focus on lock free programming needing any kind of garbage collection. Garbage collection is just for heap allocation and there are already lock free heap allocators. Memory allocation isn't, in my experience, a major difficulty of lock free data structures.

[+] gpderetta|7 years ago|reply

The issue is memory reclamation in node based data structures with multiple concurrent lock free writers. With GC it is a non issue. Otherwise you have resort to reference counting (or GC by another name), hazard pointers, RCU or similar.

[+] unknown|7 years ago|reply

[deleted]

72 comments