A lock-free ring-buffer with contiguous reservations (2019)

[+] jamesmunns|2 years ago|reply

Oh hey, one of the authors of this post here (James), happy to answer any questions.

This post has been discussed here a couple times, but AMA :)

edit, the most commented version of this post was the original:

https://news.ycombinator.com/item?id=20096946.

This is what I'm up to these days:

[+] agentultra|2 years ago|reply

Has there been a formal specification/model that we can check? I love circular buffers but I'm curious how we know that the design is correct with respect to the stated properties.

I did a bit of link diving to the various blog posts and sites but haven't been able to find it. Would be nice, if it exists, to have it front and centre.

[+] SergeAx|2 years ago|reply

Please excuse me if my question is dumb, I have little understanding of Rust. You declaring your implementation "lock-free", but are using Atomic* Rust primitives at the same time. Those are using CPU atomic instructions, if I googled correctly. Those, in their turn, are using CPU locking mechanism. Turns out that you just shifted locking from language to CPU, right?

[+] Fiahil|2 years ago|reply

I used the same approach while designing a lock-free bounded broadcast log (as in a "RWLock<Vec<T>>"; a MCMP, append-only Vec). It's quite easy to do because it's bounded. However, I could not find a way to make it both unbounded and efficient.

Any ideas ?

[+] cangeroo|2 years ago|reply

ELI5?

I see a lot of critique in the previous (2019) thread, but no summary in key points. What are the reasons that this is hard?

In multiprocessor systems, are memory writes not guaranteed to be sequential? e.g. can data be written out-of-order, after the synchronization bit is written?

Or is it more about the use case, that it is optimized for minimal number of instructions, e.g. avoiding additional memory reads? (e.g. by writing data to the buffer, instead writing storing pointers)?

Or is it that you're trying to use constant memory (irrespective of number of threads)?

Because to me, it seems like a trivial problem to solve, if you have sequential writes, store data separately from the queue, and may linearly scale memory with number of threads.

[+] el_pollo_diablo|2 years ago|reply

There is this snippet of code in the article:

  if buffer.len.saturating_sub(buffer.write.load()) >= write_len {
    // not shown: check `read` to make sure there's enough free room
    buffer.watermark.store(buffer.write.load() + write_len);
    buffer.write.store(buffer.write.load() + write_len);
  }

I would have to look at the implementation to know for sure, but that part looks incorrect to me. Suppose that the writer has placed a watermark strictly before the end of the buffer, wrapped around, and is about to write a new message; meanwhile, the reader has not reached the watermark yet (and therefore, has not wrapped around), but it has made enough progress to leave room for the writer's new message. In that case, we have write <= write + write_len < read <= watermark < len. As written, it would seem that the snippet above would incorrectly update watermark, whose value is still needed by the reader.

It seems to me that watermark need not be atomic anyway: it is owned by the writer when read <= write, and by the reader when write < read. Whoever wraps around implicitly passes the ownership to the other thread. More precisely:

* In the initial state, and until the writer reaches the end of the buffer, read <= write, and the writer owns watermark, whose value is irrelevant. In particular, the reader makes sure not to overtake write, and knows that it must not access watermark.

* When the writer needs to wrap around, it first sets the watermark (possibly at len, if the hole is empty), then updates write. With its usual release memory ordering constraint, the atomic store to write transfers to the reader the ownership of the written messages, but also of watermark.

* Eventually, the reader notices that write has wrapped around, and then uses watermark to know where to stop reading. Meanwhile, the writer may fill the beginning of the buffer, carefully not overtaking read - 1, and never touching watermark.

* When the reader finally wraps around, the atomic write to read transfers to the writer the ownership of the consumed message slots, but also of watermark, and we are back in the first phase.

IOW, watermark has the same owner as the last element of the array.

What do you think? What have I missed?

[+] iwasanewt|2 years ago|reply

Hi James! Does `https://onevariable.com/blog` have an rss/atom feed?

[+] loeg|2 years ago|reply

> The safe thing to do here is to always choose Ordering::SeqCst, "sequential consistency", which provides the strongest guarantees. ... This is often good enough in practice, and switching to a weaker Ordering is only necessary to squeeze out the last bit of performance.

If you're going to write lock-free algorithms using atomics, the least you can do is learn about ordering semantics and use the correct abstract semantics for your design's actual needs. It is much easier to do it at design time than to try to figure out if it is safe to relax SeqCst later. (This is one of the major flaws of C++ std::atomic's default SeqCst semantics.) If you aren't going to bother understanding ordering semantics, it is unlikely you can write a safe lock-free algorithm anyway. (It's really hard to do!)

[+] jamesmunns|2 years ago|reply

Back then, there weren't as good references for explaining atomic ordering, and the blog post had gotten long enough. Mentioning SeqCst was a bit of a cop out, though both Andrea and I didn't end up using SeqCst past the inital impl anyway.

Today I would have just linked to https://marabos.nl/atomics/, Mara does a much better job of explaining atomics than I could have then or now.

[+] jcelerier|2 years ago|reply

> It is much easier to do it at design time

Is it? I always worked the second way (starting from seq_cst and then when the core design matured enough and didn't change for a few months, trying to see what could actually be relaxed). I'd be very afraid that in the first case, you start with say relaxed semantics somewhere, them you change the design because the requirements changed, and now you have to go again through all the atomic operations to make sure the assumptions all still hold.

[+] derefr|2 years ago|reply

> If you aren't going to bother understanding ordering semantics, it is unlikely you can write a safe lock-free algorithm anyway.

I think the implicit suggestion here is that the target audience for this abstraction is actually two separate developers:

1. A developer who doesn’t know much about locking semantics, but can write the simple-but-slow case correctly.

2. Another developer, much later on — probably a senior SRE type — who can revisit this code and optimize it with full understanding of the constraints.

(This might also be the same person, years later, with more experience under their belt.)

The benefit of the way this library is designed, is that the second developer doesn’t have to completely rewrite everything just to optimize it. The naive dev’s code has already been forced into an “almost but not quite” lock-free mold by the design of the library’s API surface.

[+] haberman|2 years ago|reply

I've never actually seen a production lock-free algorithm that uses SeqCst. I have a hard time even imagining an algorithm where SeqCst is the right choice.

It seems like SeqCst was chosen as a default to reduce the surprises and gotchas around lock-free programming. But lock-free programming is inherently tricky; if you're trying to avoid surprises and gotchas you should probably use a mutex.

[+] ajross|2 years ago|reply

Completely agree, though the more iconoclastic corrolary that goes unspoken there is that putting The Final Word on memory ordering semantics into programming language standards was a terrible mistake.

Memory ordering is a hardware behavior. It needs to be specified at the hardware level, and hardware vendors have been very mixed on clarity. And more importantly, lockless algorithms (that rely on memory ordering control) are really, really hard, and demand clarity over all things. And instead we're crippling those poor programmers with nonsense like "sequentially consistent"[1] or trying to figure out what on earth "consume" means[2].

x86 does this pretty well with their comparatively simple model of serializing instructions. Traditional ARM ISAs did only a little worse by exposing the interface as read/write barrier instructions. Everyone else... meh.

But if you really want to do this (and you probably don't) do the analysis yourself at the level of ISA/architecture/hardware, cite your references, and be prepared to handle whatever portability works is needed on your own.

[1] A statement about desired final state, not hardware behavior!

[2] Nothing, on any hardware you will ever use. Don't ask.

[+] unknown|2 years ago|reply

[deleted]

[+] magnat|2 years ago|reply

> A typical serial port configuration is "115200 8N1", which means 115,200 baud (or raw bits on the wire per second), with no parity, and 1 stop bit. This means that for every data byte sent, there will be 8 data bits, 1 unused parity bit, and 1 stop bit, to signal the end of the byte, sent over the wire. This means that we will need 40 bits on the wire to receive a 4 data bytes.

8N1 means there is 1 start bit, 8 data bits and 1 stop bit (10 bits total), not 8 data bits, 1 unused parity bit and 1 stop bit (also 10 bits total).

[+] jamesmunns|2 years ago|reply

Yep, good catch! That's a whoops in my explanation. I don't work at FS any more, so I'm not sure I could PR that change, but you're definitely right :)

[+] rdtsc|2 years ago|reply

My favorite ring buffer structure is like the one described in https://www.gnuradio.org/blog/2017-01-05-buffers/

> It asks the operating system to give it memory-mappable memory, and map twice, “back to back”. Blocks then get called with pointers to the “earliest” position of a workload within this memory region – guaranteeing that they can access all memory they were offered in a linear fashion, as if they’d actually be dealing with hardware ring buffers.

It imposes limitations on hardware and OS support in a way, but I think it's pretty neat.

This also used by the kernel BPF ring buffer:

https://www.kernel.org/doc/html/latest/bpf/ringbuf.html

> ...data area is mapped twice contiguously back-to-back in the virtual memory. This allows to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory

[+] dist1ll|2 years ago|reply

Unfortunately that means the chunks are only contiguous in virtual memory. So it won't work with the DMA use case mentioned in the article, which requires contiguous physical addresses.

But it's still a nice trick, I like it when people get creative with HW features.

[+] gpderetta|2 years ago|reply

AKA the Magic Ring Buffer. It extremely convenient not having to deal with split messages, especially if you support variables size payloads.

[+] scottlamb|2 years ago|reply

This is a cool trick, and iirc there are a few Rust crates that implement it, including slice-deque.

...but I think there are a few significant downsides, even in userspace:

* The most obvious one: you have to write `unsafe`, non-portable code. The Linux, macOS, and Windows implementations are totally different from each other.

* The setup and teardown of each ring is a bit expensive (few system calls). For a regular allocation, the malloc implementation typically caches that for you. Here you have to do your own pooling if you might be frequently creating them.

* Using whole pages, and two mappings per ring, is wasteful in terms of not only RAM (often no big deal) but also TLB space (which often turns into significant CPU usage). If you just allocate 32 64 KiB rings from the standard allocator, on x86-64 you might be talking about a single 2 MiB huge page mapping. If you do this instead, you're talking about 1024 4 KiB page mappings.

[+] signa11|2 years ago|reply

ah the vm-trick ! have used it current (and previous) places of work to great effect.

[+] unknown|2 years ago|reply

[deleted]

[+] monocasa|2 years ago|reply

> Contended writes from multiple threads on the same memory location are a lot harder for the CPU's cache coherence protocol to handle

FWIW, those are the same location according to most cache coherency protocols, since cache coherency generally works on the cache line level. You'd want to split the two contexts to their own cache lines.

[+] stefanha|2 years ago|reply

Another cache optimization trick some ring-buffer implementations use is to keep a shadow copy of the read or write pointer to avoid frequently fetching the other context's cache line. The latest version of the read pointer is only needed when the writer catches up with their shadow copy and vice versa.

[+] dnedic|2 years ago|reply

Bipartite buffers are amazing and criminally underused. For those looking for C and C++ implementations you can check out my libraries: lfbb and lockfee: https://github.com/DNedic/lfbb, https://github.com/DNedic/lockfree (although lockfree contains more data structures as well)

[+] samsquire|2 years ago|reply

I tried to write a lock free ringbuffer with weak atomics, I haven't proved it right with TLA+ yet but I started writing a model in it. I use tagging to avoid the ABA problem.

they're all on https://github.com/samsquire/assembly, i tried to write multiple disruptor with multiple consumers, then one with multiple producers then one with multiple consumers AND multiple producers, inspired by LMAX Disruptor. (There's files for each of them and table in the repo. it's not proven yet!)

the contention on the same memory address (the read/write index) is the thing that seems difficult to address.

One thing I've learned about thread safety:

I think if you have thread-owned values then you can be thread safe with a simple semaphore, providing that you have unique, DISTINCT values for each thread.

If you have two threads that have this in a hot loop in parallel:

  // thread 0                        
  if buffer[x].available == 1:
    // do stuff
    buffer[x].available = 0           

  // thread 1
  if buffer[x].available == 0:
    // do stuff
    buffer[x].available = 1

Due to causality, no matter the interleaving, thread 0 owns the buffer[x].available and body of the if statement when it is 1 and thread 1 owns the body of the if statement buffer[x].available when it is 0.

The CMP is a cheap mutex with distinct valued memory locations.

Even though thread 1 is writing to buffer[x].available and thread 0 is writing to buffer[x].available it doesn't matter because the causality is mutually exclusive. There is no interleaving of buffer[x].available = x because of the if statement.

The buffer[x].available = 0 will never run while buffer[x].available is equal to 0 overwriting or causing a data race when setting buffer[x].available to 1. So the second line cannot happen in parallel.

I need to write a TLA model to assert its safety.

If you have more than 2 threads, then you need different tokens to provide admissability to the if statement.

Remember to use compiler memory barrier

   asm volatile ("" ::: "memory");

so you don't need volatile struct values.

[+] gpderetta|2 years ago|reply

The problem here is the

  [1] // Do stuff
  [2] Buffer[X]. available = Y

There is no explicit nor implicit ordering between 1 and 2, so the compiler or cpu can reorder them. You need a release barrier between the two.

Also while most CPUs preserve the control dependency, not all do (famously Alpha), and certainly not compilers. You would need a consume barrier, except that c++11 consume is only for data dependencies and unimplemented anyway.

Edit: with the correct barriers in place, you can prove correctness by similitude to two size 1 SPSC queues used to exchange a mutual exclusion token, with the added quirk that as the queues are never used at the same time, they can actually be physically colocated in memory.

[+] loeg|2 years ago|reply

> The buffer[x].available = 0 will never run while buffer[x].available is equal to 0 overwriting or causing a data race when setting buffer[x].available to 1.

In particular, because loads and stores of the same variable cannot be reordered out of program order. Once your algorithm involves other variables, you would (likely) need to be a little careful about loading/storing with acquire/release semantics to prevent reordering other accesses relative to this protocol.

> Remember to use compiler memory barrier

I would highly recommend using the language atomic types (and barriers if truly needed) instead of gcc inline assembly syntax.

[+] anonymousDan|2 years ago|reply

Regarding the contention, one thing that's important is to cache a local copy of the shared head and tail variables every time you access them. Then for subsequent operations you can first check the local cached copy to see if you can perform the read or write without needing to check the shared variables.

[+] samsquire|2 years ago|reply

When you check available, you might have to do it as a (__atomic_load_n(&sender->realend, __ATOMIC_SEQ_CST) and do __atomic_store_n when setting available.

rather than just a plain load.

[+] nyanpasu64|2 years ago|reply

>In Andrea's implementation of the lock-free ring-buffer, spsc-bip-buffer, some of the orderings are relaxed for performance. This has the downside that it can introduce subtle concurrency bugs that may only show up on some platform (ARM, for example): to be a bit more confident that everything's still fine, Andrea's has continous integation tests both on x86 and ARM.

It might be worth testing/forking the library to test on Loom (https://github.com/tokio-rs/loom), which can model atomic orderings and check for concurrency errors to some degree (though I last used it years ago). TSAN might be able to check for ordering errors in visited execution traces (though I haven't tried using it in the past).

[+] fritzo|2 years ago|reply

See also the Java LMAX Disruptor https://github.com/LMAX-Exchange/disruptor

I've built a similar lock-free ring buffer in C++11 https://github.com/posterior/loom/blob/master/doc/adapting.m...

[+] GenericCanadian|2 years ago|reply

I also wrote an LMAX Disruptor in Crystal: https://github.com/nolantait/disruptor.cr

Here is one in Ruby: https://github.com/ileitch/disruptor

Both languages are quite readable and I've used these to teach the concepts to beginners.

[+] bfrog|2 years ago|reply

I have to say after looking at various DMA hardware I much much prefer the scatter gather list type than the ring type of DMA.

The entire need of a bipbuffer allocation for DMA then goes way. You can have a simple pool of fixed sized blocks to throw at the DMA. Pretty easily done with a free list.

I do think the implementation here is cool though, and its nice to see some work in this area.

[+] 95014_refugee|2 years ago|reply

To make scatter/gather go fast, you either spend a lot of effort caching descriptor lists for pinned buffers (because you expect to see them often), or heavily optimising your VM's v2p translation machinery, or some combination of the two.

And then you wind up discovering that you the driver writer aren't actually trusted and so you need to insert at least one if not several IOMMUs between the peripheral and the memory(ies) that they may access, managed by another software component in a different address space.

Then someone asks you to make all of this work for clients in VMs.

At which point you start wondering why you didn't just allocate a physically contiguous buffer at startup and copy to/from your client buffers using the CPU...

Sorry for sounding triggered... 8)

[+] cultureswitch|2 years ago|reply

I was faced with the same constraints a couple of years ago and came up with an almost verbatim solution. It's an interesting problem where a lot of very subtle bugs can happen, it's good to know that I went with the same solution as people who put a lot more effort into making sure it is correct.

[+] piterrro|2 years ago|reply

This is great article! Very detailed and explains things on a low-level.

For a more high-level implementation, I just released yesterday a blog post about ring buffer in Golang: https://logdy.dev/blog/post/ring-buffer-in-golang

[+] ChrisMarshallNY|2 years ago|reply

That watermark is a simple, elegant idea.

I haven't really had the need for that kind of thing, in many years, but I like the idea.

[+] loeg|2 years ago|reply

Recently (2022) I designed a ring buffer for use as a 'flight recorder' type tracing system. (I.e., there is typically no reader; the writer needs to write over old records without blocking on any reader. If the reader is triggered, it flips a switch that disables the writer temporarily while the buffer contents are persisted.) In that design I subdivided the ring into several subbuffers (~8). Each subbuffer has its own equivalent of a watermark. That way, the valid portion of the ring always starts at the beginning of one of the subbuffers, and the writer could 'free' the next subbuffer worth of space trivially (without having to scan through old contents record by record). (Any write that did not fit in the current subbuffer was advanced to the start of the next one.)

[+] LAC-Tech|2 years ago|reply

TIL about BipBuffers. I've been struggling with a similar data structure, and to see it already has a name, and a better implementation than what I've been doing, is very welcome.

[+] liquid153|2 years ago|reply

Aren't lock free buffers usually just as expensive or more expensive to use as locks.

[+] loeg|2 years ago|reply

No -- for a lock-free design with a single producer and consumer, it's possible both are typically writing to independent regions of memory. With a lock, both have to write the same cache line to take and release the lock.

[+] zengid|2 years ago|reply

Not if your program needs to be realtime or near realtime safe. Locks are controlled by the OS typically, and can have non-deterministic latency.

[+] ww520|2 years ago|reply

Lock free has two advantages: the checking code can run in user mode and non-contested access is very cheap with just one instruction.

To do it correctly, lock needs to be done in the kernel thus obtaining a lock requires calling into the kernel which is more expensive.

I think you meant the memory barrier for syncing cache is just as expensive as the lock version, which is true.

115 comments