top | item 19032293

Benchmarks of Cache-Friendly Data Structures in C++

126 points| s3cur3 | 7 years ago |tylerayoung.com

66 comments

order
[+] usefulcat|7 years ago|reply
> Lookup in the ArrayMap is shockingly slow at large data sizes—again, I don’t have a good explanation for why this is.

Probably because it's doing a binary search over a large array? I wouldn't expect that to be particularly fast with large arrays, since large array + binary search == lots of cache misses.

There is a way (the name escapes me unfortunately) to order the items of the array such that for N items, the item that would be at index N/2 in a sorted array is at index 0, then the items that would be at N/4 and 3N/4 are at indexes 1 and 2, and so on which of course is much more cache-friendly when doing a binary search.

[+] Veedrac|7 years ago|reply
The issue is not just that it's ‘slow’, but that it's frequently slower even than std::map. My suspicion is that the issue is its use of power-of-2 sizes, which is a terrible idea for binary search because it causes cache line aliasing. This is discussed in a lot of detail by Paul Khuong.

https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathol...

> Even without that hurdle, the slowdown caused by aliasing between cache lines when executing binary searches on vectors of (nearly-)power-of-two sizes is alarming. The ratio of runtimes between the classic binary search and the offset quaternary search is on the order of two to ten, depending on the test case.

The approaches there are likely to give significant speedups.

[+] BeeOnRope|7 years ago|reply
The name is Eytzinger layout.
[+] s3cur3|7 years ago|reply
Right... this result is one of those examples of why you don't guess, you benchmark, because my intuition is that binary search should be "really fast." Clearly though the cache misses hurt a lot more than my intuitions capture.
[+] saagarjha|7 years ago|reply
> And unlike with vectors, I don’t know that I’ve ever come across code that retained map iterators…

I've done this once. The thing I was trying to do was have multiple threads write to a std::unordered_map in such a way that each thread would write to its allocated "bucket" and nothing else (roughly, each thread would "own" map[thread_id] as a somewhat convoluted "thread local storage" which would then be collected and operated on at some point in the future from the "master" thread). It turns out that to the only way to actually do this in a standards-compliant way is to grab an iterator pointing to map.find(thread_id) prior to starting and use for subsequent modifications of the value, since using the subscript operator is apparently not guaranteed to be thread safe for associative containers.

[+] gpderetta|7 years ago|reply
The subscript operator can modify the map if the key if the is not present, so it is not constitute and thus formally not thread safe even if you are sure the key is always present. At and find are always constitute so you can use those instead; no need to cache iterators.
[+] quietbritishjim|7 years ago|reply
If you didn't like caching the map iterator, you could have cached the pointer to the value object instead i.e. Foo* cached_value = &mymap[this_thread_id]. Although caching the map iterator sounds perfectly fine to me.
[+] pjc50|7 years ago|reply
People like to talk about C++ giving complete control over memory layout, but there's one very important cache-relevant technique that I've not seen done transparently: "column based storage".

That is, if you have class foo { int x, y, z; }, and make an array or vector of them, then they will normally be laid out in that order. For locality, you might want to have all X be together - ie, three separate arrays of X, Y and Z.

[+] daemin|7 years ago|reply
C++ does give you the power to lay your data members how you would like. But in this case you've just said that you want those three members to be laid out sequentially in memory. If you want to use column storage because you need to apply a specific transformation on the data then lay it out that way.

In the vast majority of cases having classes be automatically laid out in column based storage would be a detriment and not an advantage. You would actually be fighting against the cache in those cases.

For example I've seen rendering engines go from storing data in AoS to SoA, then back, then back again, then back again all depending on the hardware and tech stack available.

[+] zozbot123|7 years ago|reply
To be fair, the same is true of Rust. The issue is that with column-based storage you can't have pointers/references to individual structs within the vector. You need to use indexes.
[+] gumby|7 years ago|reply
You can however do this pretty easily in your class definition in a way that is transparent to callers. The "transparent to callers" part is pretty common to any OO model, while the "pretty easily in your class definition" is I think what most people say when they mean "complete control" -- you get much higher degrees of control than you do with, say, Java, though it will be effort beyond the default case.

I use this very approach in a code base I"m working on right now. Some object members are stored in the object and some are not but they all look like class members to the callers.

[+] lorenzhs|7 years ago|reply
On the other hand, if you do random accesses to an index but need x, y, and z, then a struct of arrays as you propose will incur three cache misses instead of one for an array of structs. It really depends on the application which is preferable.
[+] rwbt|7 years ago|reply
The std::deque is also another useful 'hybrid' container that performs close to std::vector or better in some cases when you need to 'grow' the container without reallocating memory so many times. But the only good cache friendly implementation seems to be in clang libc++ (memory is allocated in 4KiB blocks). MSVC's deque implementation is horrible (8 byte blocks) and GCC's is also not that great (512 byte blocks).
[+] svantana|7 years ago|reply
8 bytes??? Are you sure you don't mean 8 * sizeof(class)? That's just ridiculous.

Regardless it would be nice if one could specify the block size as an argument. I suppose I could write my own allocator, but it's just too much hassle for such a simple thing as storage.

[+] LHxB|7 years ago|reply
Maybe in a logarithmic plot this would look differently, but it seems to me that std::vector performs quite well in particular for few elements. So why would I bother with smallVector if simple std::vector is on par in their (supposed) prime discipline?
[+] shereadsthenews|7 years ago|reply
Vector can be slow if you create and destroy them a lot, since they allocate. You can work around to some extent by providing a custom allocator, but using something like SmallVector or absl::InlinedVector can be much faster when the N is known.
[+] s3cur3|7 years ago|reply
I'm in games, and we really do have lots of vectors of maybe 8 or so elements. The small wins add up across the entire codebase.
[+] orbifold|7 years ago|reply
Because you are developing LLVM or a similar performance critical project with lots of small vectors.
[+] wmu|7 years ago|reply
I'd love to see also comparison with B-trees (https://code.google.com/archive/p/cpp-btree/). In my tests done a few years ago this implementation was faster than std::map. And my guess is the reason of it is better use of cache.
[+] jstimpfle|7 years ago|reply
std::map is usually implemented as a Red-black tree. Which is basically a B-tree with branching factor 2. This is less cache friendly than higher branching factors, so it's entirely expected to be slower than B-trees. On the upside, it offers stable pointers which higher branching factors cannot offer.
[+] s3cur3|7 years ago|reply
Hm, I'll look into it... :)
[+] jstimpfle|7 years ago|reply
> llvm::SmallVector [...] this class preallocates some amount of data (the actual size of which is configurable via a template parameter) locally on the stack, and only performs a (slow!) heap allocation if the container grows beyond that size. Because malloc is slow, and traversing a pointer to get to your data is slow, the SSO heap storage is a double-win.

But you need to "traverse" a pointer for a stack-allocated array just as well! So it's mainly that in some cases this class helps forgo a malloc, which I am not sure is that much of win, especially given that this implementation is another class that requires more (and more complicated!) object code...

What I think might be better in many situations is preallocating the dynamic array so that it doesn't need to be preallocated each time the function is called. The preallocated array can be a function parameter (or object member, for OOP weenies :>), or simply a global variable.

[+] jcelerier|7 years ago|reply
> But you need to "traverse" a pointer for a stack-allocated array just as well!

By the time you traverse this pointer, its pointed-to memory location is almost certainly already in your L1 cpu cache since you are accessing this pointer from its parent struct, while for std::vector it can be in any random place in your memory.

In my own code, using smallvector judicously makes drastic performance differences and allows to forego an immense amount of memory allocations.

[+] forrestthewoods|7 years ago|reply
> But you need to "traverse" a pointer for a stack-allocated array just as well!

No. Iterating a small stack array is much faster than iterating a heap array.

Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".

[+] kccqzy|7 years ago|reply
How does DenseMap compare to, say hashbrown (in Rust) or SwissTable (in C++) by Google?
[+] s3cur3|7 years ago|reply
I don't have the know-how to write a Rust comparison (I'd happily accept a pull request from someone who does!), but I can take a to-do to add SwissTable benchmarks. :)