> Lookup in the ArrayMap is shockingly slow at large data sizes—again, I don’t have a good explanation for why this is.
Probably because it's doing a binary search over a large array? I wouldn't expect that to be particularly fast with large arrays, since large array + binary search == lots of cache misses.
There is a way (the name escapes me unfortunately) to order the items of the array such that for N items, the item that would be at index N/2 in a sorted array is at index 0, then the items that would be at N/4 and 3N/4 are at indexes 1 and 2, and so on which of course is much more cache-friendly when doing a binary search.
The issue is not just that it's ‘slow’, but that it's frequently slower even than std::map. My suspicion is that the issue is its use of power-of-2 sizes, which is a terrible idea for binary search because it causes cache line aliasing. This is discussed in a lot of detail by Paul Khuong.
> Even without that hurdle, the slowdown caused by aliasing between cache lines when executing binary searches on vectors of (nearly-)power-of-two sizes is alarming. The ratio of runtimes between the classic binary search and the offset quaternary search is on the order of two to ten, depending on the test case.
The approaches there are likely to give significant speedups.
Right... this result is one of those examples of why you don't guess, you benchmark, because my intuition is that binary search should be "really fast." Clearly though the cache misses hurt a lot more than my intuitions capture.
> And unlike with vectors, I don’t know that I’ve ever come across code that retained map iterators…
I've done this once. The thing I was trying to do was have multiple threads write to a std::unordered_map in such a way that each thread would write to its allocated "bucket" and nothing else (roughly, each thread would "own" map[thread_id] as a somewhat convoluted "thread local storage" which would then be collected and operated on at some point in the future from the "master" thread). It turns out that to the only way to actually do this in a standards-compliant way is to grab an iterator pointing to map.find(thread_id) prior to starting and use for subsequent modifications of the value, since using the subscript operator is apparently not guaranteed to be thread safe for associative containers.
The subscript operator can modify the map if the key if the is not present, so it is not constitute and thus formally not thread safe even if you are sure the key is always present. At and find are always constitute so you can use those instead; no need to cache iterators.
If you didn't like caching the map iterator, you could have cached the pointer to the value object instead i.e. Foo* cached_value = &mymap[this_thread_id]. Although caching the map iterator sounds perfectly fine to me.
People like to talk about C++ giving complete control over memory layout, but there's one very important cache-relevant technique that I've not seen done transparently: "column based storage".
That is, if you have class foo { int x, y, z; }, and make an array or vector of them, then they will normally be laid out in that order. For locality, you might want to have all X be together - ie, three separate arrays of X, Y and Z.
C++ does give you the power to lay your data members how you would like. But in this case you've just said that you want those three members to be laid out sequentially in memory. If you want to use column storage because you need to apply a specific transformation on the data then lay it out that way.
In the vast majority of cases having classes be automatically laid out in column based storage would be a detriment and not an advantage. You would actually be fighting against the cache in those cases.
For example I've seen rendering engines go from storing data in AoS to SoA, then back, then back again, then back again all depending on the hardware and tech stack available.
To be fair, the same is true of Rust. The issue is that with column-based storage you can't have pointers/references to individual structs within the vector. You need to use indexes.
You can however do this pretty easily in your class definition in a way that is transparent to callers. The "transparent to callers" part is pretty common to any OO model, while the "pretty easily in your class definition" is I think what most people say when they mean "complete control" -- you get much higher degrees of control than you do with, say, Java, though it will be effort beyond the default case.
I use this very approach in a code base I"m working on right now. Some object members are stored in the object and some are not but they all look like class members to the callers.
On the other hand, if you do random accesses to an index but need x, y, and z, then a struct of arrays as you propose will incur three cache misses instead of one for an array of structs. It really depends on the application which is preferable.
The std::deque is also another useful 'hybrid' container that performs close to std::vector or better in some cases when you need to 'grow' the container without reallocating memory so many times. But the only good cache friendly implementation seems to be in clang libc++ (memory is allocated in 4KiB blocks). MSVC's deque implementation is horrible (8 byte blocks) and GCC's is also not that great (512 byte blocks).
8 bytes??? Are you sure you don't mean 8 * sizeof(class)? That's just ridiculous.
Regardless it would be nice if one could specify the block size as an argument. I suppose I could write my own allocator, but it's just too much hassle for such a simple thing as storage.
Maybe in a logarithmic plot this would look differently, but it seems to me that std::vector performs quite well in particular for few elements. So why would I bother with smallVector if simple std::vector is on par in their (supposed) prime discipline?
Vector can be slow if you create and destroy them a lot, since they allocate. You can work around to some extent by providing a custom allocator, but using something like SmallVector or absl::InlinedVector can be much faster when the N is known.
I'd love to see also comparison with B-trees (https://code.google.com/archive/p/cpp-btree/). In my tests done a few years ago this implementation was faster than std::map. And my guess is the reason of it is better use of cache.
std::map is usually implemented as a Red-black tree. Which is basically a B-tree with branching factor 2. This is less cache friendly than higher branching factors, so it's entirely expected to be slower than B-trees. On the upside, it offers stable pointers which higher branching factors cannot offer.
> llvm::SmallVector [...] this class preallocates some amount of data (the actual size of which is configurable via a template parameter) locally on the stack, and only performs a (slow!) heap allocation if the container grows beyond that size. Because malloc is slow, and traversing a pointer to get to your data is slow, the SSO heap storage is a double-win.
But you need to "traverse" a pointer for a stack-allocated array just as well! So it's mainly that in some cases this class helps forgo a malloc, which I am not sure is that much of win, especially given that this implementation is another class that requires more (and more complicated!) object code...
What I think might be better in many situations is preallocating the dynamic array so that it doesn't need to be preallocated each time the function is called. The preallocated array can be a function parameter (or object member, for OOP weenies :>), or simply a global variable.
> But you need to "traverse" a pointer for a stack-allocated array just as well!
By the time you traverse this pointer, its pointed-to memory location is almost certainly already in your L1 cpu cache since you are accessing this pointer from its parent struct, while for std::vector it can be in any random place in your memory.
In my own code, using smallvector judicously makes drastic performance differences and allows to forego an immense amount of memory allocations.
> But you need to "traverse" a pointer for a stack-allocated array just as well!
No. Iterating a small stack array is much faster than iterating a heap array.
Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".
I don't have the know-how to write a Rust comparison (I'd happily accept a pull request from someone who does!), but I can take a to-do to add SwissTable benchmarks. :)
[+] [-] usefulcat|7 years ago|reply
Probably because it's doing a binary search over a large array? I wouldn't expect that to be particularly fast with large arrays, since large array + binary search == lots of cache misses.
There is a way (the name escapes me unfortunately) to order the items of the array such that for N items, the item that would be at index N/2 in a sorted array is at index 0, then the items that would be at N/4 and 3N/4 are at indexes 1 and 2, and so on which of course is much more cache-friendly when doing a binary search.
[+] [-] Veedrac|7 years ago|reply
https://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathol...
> Even without that hurdle, the slowdown caused by aliasing between cache lines when executing binary searches on vectors of (nearly-)power-of-two sizes is alarming. The ratio of runtimes between the classic binary search and the offset quaternary search is on the order of two to ten, depending on the test case.
The approaches there are likely to give significant speedups.
[+] [-] BeeOnRope|7 years ago|reply
[+] [-] s3cur3|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] saagarjha|7 years ago|reply
I've done this once. The thing I was trying to do was have multiple threads write to a std::unordered_map in such a way that each thread would write to its allocated "bucket" and nothing else (roughly, each thread would "own" map[thread_id] as a somewhat convoluted "thread local storage" which would then be collected and operated on at some point in the future from the "master" thread). It turns out that to the only way to actually do this in a standards-compliant way is to grab an iterator pointing to map.find(thread_id) prior to starting and use for subsequent modifications of the value, since using the subscript operator is apparently not guaranteed to be thread safe for associative containers.
[+] [-] gpderetta|7 years ago|reply
[+] [-] quietbritishjim|7 years ago|reply
[+] [-] pjc50|7 years ago|reply
That is, if you have class foo { int x, y, z; }, and make an array or vector of them, then they will normally be laid out in that order. For locality, you might want to have all X be together - ie, three separate arrays of X, Y and Z.
[+] [-] IshKebab|7 years ago|reply
https://en.wikipedia.org/wiki/AOS_and_SOA
I don't know of any language that transparently supports it other than Jai, which isn't available yet.
https://github.com/BSVino/JaiPrimer/blob/master/JaiPrimer.md...
[+] [-] daemin|7 years ago|reply
In the vast majority of cases having classes be automatically laid out in column based storage would be a detriment and not an advantage. You would actually be fighting against the cache in those cases.
For example I've seen rendering engines go from storing data in AoS to SoA, then back, then back again, then back again all depending on the hardware and tech stack available.
[+] [-] jcelerier|7 years ago|reply
[+] [-] zozbot123|7 years ago|reply
[+] [-] gumby|7 years ago|reply
I use this very approach in a code base I"m working on right now. Some object members are stored in the object and some are not but they all look like class members to the callers.
[+] [-] lorenzhs|7 years ago|reply
[+] [-] rwbt|7 years ago|reply
[+] [-] svantana|7 years ago|reply
Regardless it would be nice if one could specify the block size as an argument. I suppose I could write my own allocator, but it's just too much hassle for such a simple thing as storage.
[+] [-] LHxB|7 years ago|reply
[+] [-] shereadsthenews|7 years ago|reply
[+] [-] s3cur3|7 years ago|reply
[+] [-] orbifold|7 years ago|reply
[+] [-] wmu|7 years ago|reply
[+] [-] jstimpfle|7 years ago|reply
[+] [-] s3cur3|7 years ago|reply
[+] [-] jstimpfle|7 years ago|reply
But you need to "traverse" a pointer for a stack-allocated array just as well! So it's mainly that in some cases this class helps forgo a malloc, which I am not sure is that much of win, especially given that this implementation is another class that requires more (and more complicated!) object code...
What I think might be better in many situations is preallocating the dynamic array so that it doesn't need to be preallocated each time the function is called. The preallocated array can be a function parameter (or object member, for OOP weenies :>), or simply a global variable.
[+] [-] jcelerier|7 years ago|reply
By the time you traverse this pointer, its pointed-to memory location is almost certainly already in your L1 cpu cache since you are accessing this pointer from its parent struct, while for std::vector it can be in any random place in your memory.
In my own code, using smallvector judicously makes drastic performance differences and allows to forego an immense amount of memory allocations.
[+] [-] forrestthewoods|7 years ago|reply
No. Iterating a small stack array is much faster than iterating a heap array.
Instead of "traverse" the author means you have to follow the pointer out to main memory to read the data. Another term they might have used is "fetch".
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] kccqzy|7 years ago|reply
[+] [-] s3cur3|7 years ago|reply