An Introduction to Lock-Free Programming

[+] haberman|14 years ago|reply

Nice article, though I think that any intro-level material on lock-free programming should always include a "don't try this at home for anything important" warning. Until you have some experience with this stuff you will almost certainly make mistakes, but these mistakes might only manifest themselves as crashes in extremely rare circumstances.

I wrote my first lock-free code in 2004 based on reading some papers by Maged Michael from IBM. I wrote a lock-free FIFO in PowerPC assembly, and was convinced it was safe and robust. When I emailed Maged about it, he pointed out that if a thread was suspended on one specific instruction and some specific memory was unmapped before it could run again, the program could crash. I was amazed; I had thought hard about this algorithm, but had completely missed that possibility.

Some other specific notes about the article:

> Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free.

The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm

> Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done.

One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem (http://en.wikipedia.org/wiki/ABA_problem). In practice this doesn't matter that much since x86 doesn't support LL/SC, but I just think it's an interesting factoid to know.

> For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but the x86/64 family of processors from Intel and AMD cannot.

(I've edited my reply here since my original assertion was incorrect). It's true that x86/64 won't reorder stores (see http://en.wikipedia.org/wiki/Memory_ordering for details) but it will reorder loads, so memory barriers are still required in some situations. However I believe that the atomic instructions ("lock cmpxchg", and "lock xadd") imply full barriers on x86.

[+] preshing|14 years ago|reply

I totally agree with you about the dangers of lock-free programming, and that every programmer will certainly make mistakes. I feel strongly enough about it that I actually wrote several paragraphs about the importance of stress testing lock-free algorithms, which I eventually decided to break off into a separate post... I might get around to publishing that, we'll see. We're not the first to feel this way :)

> The are actually several levels of lock-freedom defined in the literature: lock-freedom, wait-freedom, and obstruction-freedom. For more info see: http://en.wikipedia.org/wiki/Non-blocking_algorithm

It's true, and I had a hard time reconciling that Wikipedia page with how others had been describing lock-free programming, which is why I spent so much of the post on semantics.

> One benefit of load-linked/store-conditional (often abbreviated LL/SC) is that it avoids the ABA problem

Good point. Though I suspect if you access other cache lines between the LL & SC, it may invalidate the reservation.

> I don't think this is true about x86/64.

It is true about x86/64, at least for the most common, non-SSE instructions and normal, non-write-combined memory. I'll update the post to be more specific. See volume 3, section 8.2.2 of the Intel 64 and IA-32 Developer's Manuals: (http://www.intel.com/content/www/us/en/processors/architectu...)

- Reads are not reordered with other reads. - Writes are not reordered with older reads. - Writes to memory are not reordered with other writes (with a few exceptions for special instructions)

[+] scott_s|14 years ago|reply

To support your first point, I implemented Maged's lock-free allocator so we could compare against it (http://www.scott-a-s.com/files/michael.tar.gz), and I never got it working correctly on Itanium. I spent weeks trying to figure out what was wrong. I eventually concluded that I did not understand the Itanium architecture and the algorithms used in the allocator well enough to be confident I could fix it - and that it wasn't worth improving my understanding of either. (I asked Maged if he had any experience with his allocator on Itanium, and he did not.)

[+] tptacek|14 years ago|reply

The idea that made this click for me 10 years or so ago was that distributed systems express concurrent processes but can't reasonably use locks; model your threads/processes/whatever as if they were nodes in a distributed system, and use distributed system techniques to ensure things converge.

[+] Retric|14 years ago|reply

That's not exactly a shared memory model. A better example would be a static look-up table for fixed game state information and a bunch of AI worker threads.

With the correct protocol you can even write to that shared memory. For example a sudoku solver where each tread looks for contradictions and then overwrites the shared states with their findings. This works because there is no contradictions, dirty reads are ok, and there is no contention with all threads writing 0's and nobody writing 1's. You can go even further where any thread can flip the error flag if a contradiction shows up or solution flag if a solution is found, but again it works because nothing clears those flags.

PS: The article's author uses a less strict defintion of lock free where it's ok to use locks for individual atomic operations like counters. But, that still creates performance issues at scale.

[+] strlen|14 years ago|reply

Lock free algorithms still use primitives like memory barriers. A memory barrier in a shared memory system is expensive, but not nearly as difficult as a memory barrier in a distributed system.

Additionally, distributed systems not only experience failures of individual components, but they're often in a state where you can't reliably detect failures (e.g., tell a failed node apart from a node that's slow or a node that you can't temporarily talk to while other nodes can).

It's also possible to think of it as a continuum: for example, failure detection is simpler in a distributed system connected via Infiniband than in a system connected via commodity ethernet; locality matters in NUMA CPUs and so on. The idea that a commodity multi-core machine has some properties of distributed system is not without merit, but it isn't particularly useful or actionable as a working model.

[+] rdtsc|14 years ago|reply

There is also a niche research area of lock-free data structures and algorithms.

On a more practical side, check out http://www.liblfds.org/ -- it is a library of lock free data structures in C. (I am not the author). I have successfully used this library in some realtime projects.

[+] sbahra|14 years ago|reply

http://concurrencykit.org/ is also available, used in several production systems.

[+] sbahra|14 years ago|reply

A while back I linked to the following: http://concurrencykit.org/presentations/lockfree_introductio...

Please make sure to navigate all the way down, before navigating to the right. Press the space bar to see the full layout of the presentation.

[+] rossjudson|14 years ago|reply

Very good overview. I've built lock-free hash maps, and this article covers the areas you need to be aware of. I'd also suggest looking up false sharing, and also making sure that you really understand exactly how the memory barriers are operating.

[+] cheatercheater|14 years ago|reply

Make sure to check out pi calculus for a different model of concurrency without locks: http://mainisusuallyafunction.blogspot.com/2011/09/lambda-to...

20 comments