Optimizing a ring buffer for throughput (2021)

[+] jim90|2 years ago|reply

There is a cool trick using virtual memory allowing fast and gapless ring buffers.

See this page (about half way down) https://ruby0x1.github.io/machinery_blog_archive/post/virtua...

[+] crest|2 years ago|reply

It is a cool trick and even without virtual memory if the maximum read size has a low upper bound just copying a the first few bytes from the beginning to the end can be worth it to avoid the complexity and overhead of indexing into the ring buffer modulo its size.

This clever ring buffer has hidden costs that don't show up in microbenchmarks testing only the push() and pop() operations on an initialized ring buffer. These can be very high and impact other software as well.

The initialisation requires multiple system calls to setup the aliased memory mapping and it has to be a multiple of the page size. This can be very expensive on big systems e.g. shooting down the TLBs on all other CPUs with interprocessor interrupts to invalidate any potentially conflicting mappings. On other end small systems often using variations of virtual indexed data caches and may be forced to set the aliased ring buffer pages as uncacheable unless you can get the cache coloring correct. And even for very small ring buffers you also use at least two TLB entries. A "dumber" ring buffer implementation can share part of a large page with other data massively reducing the TLB pressure.

Abusing the MMU to implement an otherwise missing ring addressing mode is tempting and can be worth it for some use-cases on several common platforms, but be careful and measure the impact on the whole system.

[+] defrost|2 years ago|reply

It sprang from old mainframe implementations - there's a 2002 (??) Virtual Ring Buffer in Phil Howard's LibH collection (2007 snapshots of 2002 copyright code based on circa 1980 implementations IIRC):

* https://web.archive.org/web/20140625200016/http://libh.slash...

VRB:- These functions implement a basic virtual ring buffer system which allows the caller to put data in, and take data out, of a ring buffer, and always access the data directly in the buffer contiguously, without the caller or the functions doing any data copying to accomplish it.

* The "Hero" code

http://libh.slashusr.org/source/vrb/src/lib/h/vrb_init.c

* Front & back end source directories:

https://web.archive.org/web/20100706121709/http://libh.slash...

[+] fulafel|2 years ago|reply

Are the memory semantics programmer-friendly on all (or most) platforms when using aliasing mappings like this? Eg if two CPUs concurrnetly update the memory through different VM address aliases, is the observed semantics identical vs accessing through only one address or not?

[+] gpderetta|2 years ago|reply

AKA the Magic Ring Buffer.

[+] sebstefan|2 years ago|reply

I didn't know the details of the MESI protocol so that prompted me to look it up and NC State University has a 10 minute lecture on Youtube that makes is very easy to grasp - https://www.youtube.com/watch?v=wqF9O7odNKY

Apparently Intel uses a variant called MESIF and AMD went for MOESI (https://www.realworldtech.com/common-system-interface/5/, https://en.wikipedia.org/wiki/MESIF_protocol) (https://www.amd.com/system/files/TechDocs/24593.pdf, https://en.wikipedia.org/wiki/MOESI_protocol)

[+] samsquire|2 years ago|reply

Thanks for this.

I used this multiproducer multiconsumer ringbuffer

https://www.linuxjournal.com/content/lock-free-multi-produce...

I am inspired by LMAX Disruptor. I like the idea of splitting all IO into two halves: submit and reply. At one point I was looking slides of some company who used LMAX Disruptor pattern for their IO and had high performance.

My extremely naive Java benchmark to test locks I get 35,871,639 locks per second on 12 threads with contention. I am trying to work out how to improve the performance of the multiproducer multiconsumer ringbuffer which gets 10 million requests per second it should be higher.

https://github.com/samsquire/multiversion-concurrency-contro...

I want to create a message queue that uses ringbuffers to communicate between threads efficiently and elegantly and handles the splitting of IO in half. I want to combine coroutines and threads together.

[+] bob1029|2 years ago|reply

LMAX Disruptor was very inspirational to me as well.

I've found a lot of problem spaces it fits into perfectly. For example, any sort of MPSC arrangement. Literally any. The most obvious being if you want to DIY your own database.

If performance is of interest to you, you may want to check out the latest variants of Disruptor on .NET. Figures nearing 500 million per second range are present here, and will likely be exceeded in future versions of .NET:

https://medium.com/@ocoanet/improving-net-disruptor-performa...

The thing .NET has to its advantage (not sure if this is true anymore) is how structs vs classes work in memory. This is why there is a speedup with ValueDisruptor<T>.

> how to improve the performance of the multiproducer multiconsumer

> multiconsumer

Fundamentally, I think multi-consumer is going to chop a whole order of magnitude off no matter how you slice it. Having just one consumer (aka a 'writer') is a foundational aspect of the performance games played in this arena.

[+] Const-me|2 years ago|reply

I usually avoid solving that problem. There’s a good workaround — batch items, and either transfer ownership, or implement shared ownership, instead of copy.

For an example, look at how multimedia frameworks are designed: Media Foundation, gstreamer, ffmpeg, to lesser extent V4L2 and ASIO. For audio, they often handle samples coming at 48kHz, and they are usually heavily multithreaded, yet they don’t have ring buffers for individual samples, their ring buffers are keeping batches of them.

[+] barbegal|2 years ago|reply

> I chose to allow any queue size as opposed to allowing only sizes that are a power-of-two. This means that at least one queue item is unused in order to disambiguate between the empty queue and full queue state.

Don't you still need an unused queue item even with a power of two size? Isn't the point of having a power of two size that you can calculate the next item index in the buffer using binary rollover to go back to 0 as opposed to requiring a comparison and a branch.

[+] gpderetta|2 years ago|reply

Normally I use monotonically increasing indices to avoid the 1 element issue as there is no wraparound, then use masking to map them to actual buffer positions. With non-power of two sizes it becomes a more expensive operation, so you have to wrap on increment instead.

[+] crest|2 years ago|reply

If it's a power of two you can use a longer than required index. An empty buffer will have all index bits equal. A full buffer will have the higher bits different but the lower bits equal. This requires extracting only the lower bits from the index before using it, but depending on the instruction set and bit position this is either very cheap or completely free.

[+] JonChesterfield|2 years ago|reply

You can distinguish full and empty using x-y=0 vs x-y=N if the length N is a power of two and smaller than the integer type holding indices x, y. No unused element needed then.

[+] dnedic|2 years ago|reply

If you want more than a spsc queue, I've written `lockfree`, a collection of SPSC and MPMC data structures along the same principles the author here used:https://github.com/DNedic/lockfree.

The library is written in standard C++11 (but additional API's for higher C++ versions have been added), uses no dynamic allocation and is configurable so it is both big metal and deeply embedded friendly.

[+] cafxx|2 years ago|reply

A point that AFAICT is not articulated in the article is why the two cached fields should be in their own dedicated cache line (e.g. why readIdxCached_ can not share the cache line with writeIdx_).

[+] gpderetta|2 years ago|reply

They should be in the same cacheline as you suggest. The writer will always touch both writeIdx and readIdxCached so it makes sense to be together. Splitting them won't affect synthetic benchmarks, but it just wasting space in a larger application.

[+] lordnacho|2 years ago|reply

False sharing perhaps? At least that is my shoot-from-the-hip response.

https://stackoverflow.com/questions/22766191/what-is-false-s...

[+] gpderetta|2 years ago|reply

Another simple trick is duplicating the buffer pointer field and buffer size field for consumer and producer (instead of having a single vector field). Paradoxically this makes the queue object smaller as it make use of the wasted padding used for cache alignment.

[+] xoranth|2 years ago|reply

> Another simple trick is duplicating the buffer pointer field and buffer size field for consumer and producer

Do you mean something like this?

    struct Inner {
        int\* data;
        size_t size;
        size_t cachedIdx;
    };

    struct ringbuffer3 {
        // No vector<int> [...] here, saves a cache line
        alignas(64) std::atomic<size_t> readIdx_{0};
        alignas(64) Inner writeInner;
        alignas(64) std::atomic<size_t> writeIdx_{0};
        alignas(64) Inner readInner;

      // Constructor ...
    };

Am I understanding correctly?

(edit: formatting)

[+] rigtorp|2 years ago|reply

It might also be better for performance since the two cores can RFO the buffer pointer cache lines from each other.

[+] jhallenworld|2 years ago|reply

There is a trick to get rid of the wrap-around test. Use modular arithmetic (best if ring size is a power of 2, then even when the integer wraps everything still works):

    next_write_idx = write_idx + 1;
    if (next_write_idx == read_idx) return full;
    ring[write_idx & (RING_SIZE - 1)] = data;
    write_idx = next_write_idx;

This eliminates one possible incorrectly predicted branch and allows you to use all slots of the ring. The original code always leaves one slot empty to avoid the ambiguity between empty and completely full (where in both cases, the indexes have the same value). This could matter for small rings.

The way to think of the indexes: they hold the number of items written to or read from the ring since reset.

I guess a similar optimization to the cached entries is to pop as many items as possible before returning:

    local_write_idx = write_idx;
    local_read_idx = read_idx;
    while (local_read_idx != local_write_idx)
       process(ring[local_read_idx++] & (RING_SIZE - 1)]);
    read_idx = local_read_idx;
    return;

This has other advantages: the ring read is a tight loop and you are repeatedly calling "process"- the working set (the cache and branch predictors) will be all set up for it after the first call.

[+] lovich|2 years ago|reply

Is there any sort of guide for a webdev to learn enough c++/c/assembly/hardware knowledge to understand this sort of blogpost deeply?

I am vaguely aware of some of the topics discussed in here like how when he mentions setting the write and read indices to the cache size he is referencing the L1 cache on the processor and how the L1 and L2 caches are different and every step up in memory on the processor is a order of magnitude difference in time to access. I know this about as well as the fact that I know that the stratosphere and troposphere are separate from the atmosphere, in that I know they exist but I don't understand all the implications.

I'd like to learn more of this(relative to my own software layer) deep magic but I don't know where to start.

[+] nickelpro|2 years ago|reply

You stated all the implications, there's not significantly more to cache coherence than what you discussed here.

Well, there's significantly more if you're trying to implement coherence as an architect, but for the software author that's pretty much it.

Cache lines have a size, if you have a line that's shared between two cores they keep needing perform an eviction each time one of the cores takes exclusive control to perform a write to that cache line, this is slow, the end.

If you want the definitive source for this stuff, the Intel Architecture SDM, Volume 3 Ch12 would be a place to start. Then the Intel Optimization manual, Ch 3.6/3.7, and the Ch 2.x.y chapters for the micro-architecture which will describe the cache layout

[+] anonymoushn|2 years ago|reply

you can read the oft reposted What Every Programmer Should Know About Memory[0]. I also wrote a comment about this sort of thing recently which includes more links[1].

[0]: https://news.ycombinator.com/item?id=36420502

[1]: https://news.ycombinator.com/item?id=33760117

[+] FullyFunctional|2 years ago|reply

It's just amortizing for large transfers. For frequent near-empty cases it still has the shared pointers problem. NVMe have a really clever way to avoid this, but it depends on the fact that there's [occasional] communication on the reverse stream (the SQ/CQ pair): The cache line sized entries in the submit queue has a "phase" bit in the last word. The reader can just read that to know when the entry has updated (the polarity of the bit toggles each time we loop around and you can just use the top+1 index bit if power-of-two sized).

The back pressure is handled with a credit scheme and the producer gets an updated copy of most recently-know consumer read counter with every completion message back.

Using this scheme you can achieve the optimal performance with just a single cache line of traffic for each cache-line sized message.

Unfortunately I haven't found a SPSC Rust crate that does this, but https://github.com/polyfractal/bounded-spsc-queue [abandoned] comes close.

[+] bruce343434|2 years ago|reply

This seems to suffer from race conditions. Consider what happens when two threads push at the same time.

[+] nly|2 years ago|reply

The solution to your caching problem is more caching

[+] crest|2 years ago|reply

The solution is to avoid synchronous blocking operation like migrating cache lines more often than required. An other solution is to offer a batch produce()/consume() API to process multiple elements per invocation. A double mapped ring buffer is also tempting because it avoids having to deal with modulo addressing.

56 comments