top | item 35684232

An introduction to lockless algorithms (2021)

216 points| signa11 | 2 years ago |lwn.net

69 comments

[+] gsliepen|2 years ago|reply

"Lockless" sounds very positive, and it's tempting to think that everything should be made lockless. However, it's less great than it sounds.

Mutexes are very cheap in the uncontended case, and in the contended case they have some nice properties. You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex. Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

Another problem with lockless is that you can't turn every algorithm into a lockless one, it's actually only a handful of them that are practical to implement locklessly. Of course, the ones that are practical, like some queue and linked list algorithms, can give very nice benefits if used correctly.

[+] dist1ll|2 years ago|reply

To add to this, the superset of these ideas is referred to as "non-blocking", and lock-freedom is one flavor. There is also obstruction-freedom, starvation-freedom, wait-freedom, etc.. All of which have their own trade-offs and complexities. If you're interested in some literature, McKenney's book[0] is considered a great practical introduction to these ideas. (also discussed before on HN [1][2])

In applications where you're willing to trade throughput for latency (like RTOS schedulers, drivers, audio processing), it can often make sense to go for a wait-free data structure.

[0] https://arxiv.org/pdf/1701.00854.pdf

[1] https://news.ycombinator.com/item?id=34859102

[2] https://news.ycombinator.com/item?id=26537298

[+] themulticaster|2 years ago|reply

I'd like to add some context as to why lockless algorithms are popular in the Linux kernel. Roughly speaking, there are two possible scenarios in which kernel code is executed: Either in context of a process (e.g. while handling a syscall) or while handling an interrupt [1]. Now, a mutex usually sets the current thread to sleep if it tries to lock a currently locked mutex. Even in kernel land, this is perfectly fine as long as the kernel is in process context. However, while handling an interrupt there is no current thread that can be put to sleep. Therefore you can't use a mutex for kernel data structures that will be accessed during interrupts, and that applies to a lot of data structures.

This is why some common locking advice does not really apply to kernel code. Another example would be "avoid using spinlocks, prefer a mutex" - spinlocks are widely used in the kernel since they're the most straightforward alternative to mutexes.

Not disagreeing with the parent post, just wanted to add some context why LWN likes to talk about lockless algorithms.

[1] Actually, the kernel differentiates between hardirqs and softirqs, but that's not important here.

[+] klabb3|2 years ago|reply

> Mutexes are very cheap in the uncontended case

It was a while ago I was deep into this mess so forgive any ignorance–but–iirc the thread-mutex dogma[1] has many pitfalls despite being so widely used. Primarily they’re easy to misuse (deadlocks, holding a lock across a suspend point), and have unpredictable performance because they span so far into compiler, OS and CPU territory (instruction reordering, cache line invalidation, mode switches etc). Also on Arm it’s unclear if mutices are as cheap because of the relaxed memory order(?). Finally code with mutices are hard to test exhaustively, and are prone to heisenbugs.

Now, many if not most of the above apply to anything with atomics, so lock-free/wait-free won’t help either. There’s a reason why a lot of concurrency is ~phd level on the theoretical side, as well as deeply coupled with the gritty realities of hardware/compilers/os on the engineering side.

That said, I still think there’s room for a slightly expanded concurrency toolbox for mortals. For instance, a well implemented concurrent queue can be a significant improvement for many workflows, perhaps even with native OS support (io_uring style)?. Another exciting example is concurrency permutation test frameworks[2] for atomics that reorder operations in order to synthetically trigger rare logical race conditions. I’ve also personally had great experience with the Golang race detector. I hope we see some convergence on some of this stuff within a few years. Concurrency is still incredibly hard to get right.

[1]: I say this only because CS degrees has preached mutices to as the silver bullet for decades.

[2]: https://github.com/tokio-rs/loom

[+] danbruc|2 years ago|reply

Another problem with lockless is that you can't turn every algorithm into a lockless one, it's actually only a handful of them that are practical to implement locklessly.

You can make every data structure lock-free, even wait-free, there are universal constructions [1]. The resulting performance may however be prohibitive.

[1] https://www.researchgate.net/publication/221343511_Impossibi...

[+] spacechild1|2 years ago|reply

You are totally right. Here's my rough personal guideline: prefer mutexes over lockfree code - unless you can't. For example, you cannot lock a mutex from the audio callback because it might put the thread to sleep and you would miss your deadline. (For anyone interested in this topic: http://www.rossbencina.com/code/real-time-audio-programming-...)

[+] loeg|2 years ago|reply

Mutexes are implemented with (among other) cmpxchg instructions under the hood. It’s sort of a false dichotomy to divide the world into locks and cmpxchgs. Also inconsistent to claim they are slow but uncontended mutexes are fast. It is certainly true that you can write slower and harder to understand algorithms using atomic primitives instead of mutexes.

[+] mattpallissard|2 years ago|reply

> You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex.

I'm with you. If you aren't benchmarking your datastructure changes, you probably shouldn't be messing around with lock-free code to begin with.

[+] corbet|2 years ago|reply

Even an uncontended mutex can create a lot of cache-line bouncing, making it less cheap than it seems.

[+] dataflow|2 years ago|reply

> Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

You'll need to expand on this, since mutex acquisition also requires similar atomic read-modify-write operations. I think what you may be trying to say is that lockless algorithms require more of these operations as they scale to more processors?

[+] bsder|2 years ago|reply

> You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex.

I would, in fact, give precisely the reverse advice for one reason:

Mutexs don't compose.

Mutexs are really good at putting subtle bugs into your code that are ridiculously difficult to figure out.

Lockless algorithms may be slower or cause performance issues, but they don't malfunction because you got the release order backwards.

[+] duped|2 years ago|reply

I think even this muddies the waters - oftentimes people think of performance in terms of metrics (latency, bandwidth, memory, etc). Lock-free algorithms are about performance guarantees.

Specifically, a lock-free algorithm is appropriate if it would be considered an error for one thread to stall and cause any other threads to stall.

An example of this would be a concurrent garbage collector. These are used when you do not want garbage collection to stop a program from executing. The way that the GC thread communicates with the rest of the program therefore must be lock-free.

That's not to say a stop-the-world GC could be faster: just that its lack of performance guarantees would be considered a bug.

[+] kerkeslager|2 years ago|reply

> You should only consider going for a lockless algorithm if you can prove it is better than just using a mutex.

This is true for all optimizations: optimizations should be chosen based on profiling, not based on reasoning about the programs. Reasoning can give you ideas of what to profile, but modern programs, including the environments in which they run, are complex enough that you can never really be sure that the assumptions you're basing your reasoning on are true.

So I fully agree with this, but I have some questions about the rest of your post.

> Mutexes are very cheap in the uncontended case, and in the contended case they have some nice properties. Sometimes lockless algorithms are in fact slower than algorithms using locks, as things like atomic compare-and-exchange instructions can have a significant cost.

My (very limited) experience in this area is mostly with optimistic lock-free algorithms, so maybe there's something that I'm missing. But my understanding of optimistic lock-free algorithms is that they make use of the idea that the uncontended case is the most common case, and try to optimize that case heavily. On Intel processors, a compare-and-exchange operation should literally be a single instruction (CMPXCHG[1]). So I'm wondering how the uncontested case using a mutex could be seen as a positive?

> Another problem with lockless is that you can't turn every algorithm into a lockless one, it's actually only a handful of them that are practical to implement locklessly.

How could "you can't turn every algorithm into a lockless one" could be true? It seems to me that you can "simulate" an atomic compare and swap using mutexes--although I can't find a reference easily, I believe this is what GCC's __sync_X_compare_and_swap does on architectures which don't have a compare-and-exchange instruction. Likewise you can implement a mutex fairly easily using a compare-and-swap operation.

Whether this is practical would of course depend on the situation--implementing a mutex in terms of compare-and-exchange or vice-versa are likely not the optimal ways to implement either. But it seems to me that if compare-and-exchange and mutexes can be implemented in terms of each other, isn't that a trivial proof that any mutex algorithm can be implemented as a lock-free algorithm and vice-versa?

[1] https://www.felixcloutier.com/x86/cmpxchg

[+] cryptonector|2 years ago|reply

You're too focused on "mutex". There are many more interesting constructs that one can build in a lockless manner: lists, hash tables, RCU, etc.

Here's some of mine: https://github.com/cryptonector/ctp/

[+] dfox|2 years ago|reply

The problem with typical OS-provided mutex is that it is almost always in some way distinctly non-cheap either because of the particular implementation or because of limitations imposed by ABI/API compatibility (and in some case both).

[+] liblfds-temp|2 years ago|reply

I am the author of liblfds, a portable, license-free, lock-free data structure library written in C.

I've been working on other projects for the last number of years, but I'm finally back to liblfds and making progress toward the next release.

https://www.liblfds.org/slblog/2023-04.html

[+] loeg|2 years ago|reply

I might also point people at ConcurrencyKit.

[+] samsquire|2 years ago|reply

I like the code of this ringbuffer

https://www.linuxjournal.com/content/lock-free-multi-produce...

I liked this whitepaper https://www.cs.technion.ac.il/~erez/Papers/wfquque-ppopp.pdf

I am a beginner at this kind of thing but I created an array of integers that each thread owns an index. They write to the array at their index that they want access to the critical section.

We scan the array forwards and backwards to see if there is any thread that has claim to the critical section.

I even TRIED to write a model checker https://github.com/samsquire/multithreaded-model-checker

This is inspired by left-right concurrency control whitepaper Left-Right: A Concurrency Control Technique with Wait-Free Population Oblivious Reads

[+] dist1ll|2 years ago|reply

The article in the first link is pretty bad, and the entire code is undefined behavior. First of all, using inline asm + volatile to emulate well-ordered release & acquire on x86 is unnecessary. That's exactly what atomics are supposed to provide for you.

> In this article, I concentrate on x86, as it is the most widespread architecture rather than write generic (but slower) code.

Any reasonable C++ compiler will generate simple loads and stores for atomic r&a. There's no penalty for writing generic code.

[+] snakey|2 years ago|reply

I too am a beginner with concurrent programming—it's a difficult topic to get your head around at first. Some resources that I've found super helpful:

- Dmitry Vyukov's Lockless Algorithms: https://www.1024cores.net/home/lock-free-algorithms

- Jeff Preshing's blog (worth exploring all adjacent articles): https://preshing.com/20120612/an-introduction-to-lock-free-p...

- Bartosz Milewski: https://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-...

- Memory Barriers: a Hardware View for Software Hackers—P.McKenney: https://www.researchgate.net/publication/228824849_Memory_Ba...

I hope you find some of these useful. I've re-read Paul McKenney's paper every 2-3 months in an attempt to get this stuff to stick! :)

[+] ki_|2 years ago|reply

Ringbuffers are one of my favorite things to use. Thanks for the article.

[+] atorodius|2 years ago|reply

I am a noob here, so maybe this is obvious, but how exactly is a "lock" defined? it seems to me the acquire/release semantics are very similar to locks? I.e. I'm struggling to understand what is "lockless" about the primitives.

[+] haberman|2 years ago|reply

I only had time to skim, but the primitives described here (acquire, release, happens before) sound almost exactly like the model that was standardized in C++11 and C11.

[+] loeg|2 years ago|reply

This is not a coincidence. :-)

[+] DSingularity|2 years ago|reply

When “lockless” algorithms rely on CMPXCHG aren’t they just shifting the lock from software to hardware?