top | item 17914571

Benchmarking C++ Sets

54 points| cpphacker | 7 years ago |constantsmatter.com | reply

52 comments

order
[+] xenadu02|7 years ago|reply
Always measure!

In a realistic application (with cache pressure) you may find an array performs even better because the CPU can trivially predict and pipeline the entire operation vs jumping around to random locations in memory.

Another approach is a B-tree with lots of nodes stored at the leaves. This can get you the array's linear performance for small datasets, while preserving other big-O benefits of sets/trees (eg insertion time).

As the side of your dataset grows, there is a performance cliff at L1, L2, and L3 cache size. For very large datasets, creating data structures with predictable access patterns often trumps Big-O entirely. If following a linked list thrashes the cache it turns out the constant factor can and does dominate.

So again, always measure with your application on your intended hardware. Be sure you measure with realistic loads and access patterns.

[+] AstralStorm|7 years ago|reply
Vector is supposed to essentially optimize to direct array, except on insert or append when it may resize.

Std::Array is expected to have almost identical performance characteristics.

[+] curyous|7 years ago|reply
I wish more people understood this.
[+] forrestthewoods|7 years ago|reply
These results don't make any sense and I don't believe them.

The author is generating random strings of length 64. They then generate N random string and search for them. The probability of finding a match is effectively zero.

> Interestingly, up to about 75 elements, the vector actually outperforms both sets.

No way. The unordered_set will perform one hash, one lookup, and return false. That is not going to be slower than SEVENTY FIVE string comparisons.

Looking at the code I _think_ that randString() for each of the N search strings is included in the timing code. That, uh, is not good? Line 29: https://github.com/akshaynanavati/benchmarks/blob/master/ben...

I don't know how Google Benchmark works. So I wrote my own simple test. I could not find any combination of string length and num searches where a vector was faster than an unordered_set.

https://pastebin.com/raw/mfBmGa5B

Example results:

String length: 8 Num Elements: 75 Num Searches: 10000 Gen Time: 1.94252 Vec Time: 4.12721 Set Time: 0.144987 Num Found: 0

[+] AstralStorm|7 years ago|reply
Not it won't for this small strings. A 64 byte string likely fits in a cache line so you're doing a cache line comparison which may even have a microop in the CPU. Alternatively using SSE or AVX to vector compare the strings which is still really fast.

I suspect the author made critical mistakes too.

If you are actually hitting a match or do many operations with warmed cache, vector may actually win big time even compared to unordered_set as there may be no pointer chasing involved. (since both all keys and all values are in cache) The vector scan essentially boils down to a memcmp for a sufficiently smart compiler. Computing the hash may or may not be faster than this.

[+] chillee|7 years ago|reply
Somewhat interestingly, C++ STL unordered_map is so much slower than it needs to be: https://www.youtube.com/watch?v=fHNmRkzxHWs&t=47m13s

Using one of the policy-based data structures instantly gets you about an order magnitude of speedup (for accesses).

[+] asdfasgasdgasdg|7 years ago|reply
Kinda makes you wonder why those decisions (to provide bucket iteration, bucket iterators) were made. It's hard to picture a use for such a feature that makes it worthy of adding constraints to a standard container. I would love it if someone could provide some insight into the decision process that led to this state of affairs.
[+] fulafel|7 years ago|reply
Anyone have a text link or tl;dr about the issue?
[+] shin_lao|7 years ago|reply
I have a lot of issues with this blog post:

- Compiler, c++ library, compilation options, and platform are missing

- The author doesn't benchmark vector and unordered_map insertions with reserve()

- If I read correctly, the author benchmarks up to 32,768 items. This is way too low. There should be tests with millions of entries.

- Random string/numbers isn't a realistic scenario and very favourable to unordered_map. I would use data from real life examples, e.g. list of cities, list of people, etc.

- Ordered arrays are missing from the benchmark (e.g. boost::container::flat_set), it's a very powerful data structure. Why just do a sorted vector at the end?

- Vector outperforms unordered_set probably because of the hashing function used.

[+] Jyaif|7 years ago|reply
Even if they had did all that, some person on the internet would have had "a lot of issues with the post" in order to show off how smart they are.

You have to draw the line somewhere, and this post is very interesting as it is.

[+] IshKebab|7 years ago|reply
None of those things are necessary to convey the point he is trying to make.
[+] ebikelaw|7 years ago|reply
It is very rare that I have seen a use of std::set that seemed justified, although it does happen. People just use it because it's easier to type than unordered_set!

One way to speed up set's usually appalling performance is to give it a custom allocator so that instead of N calls to allocate tiny tree nodes all over the heap, you get all your tree nodes on an arena. This can dramatically speed up short-lived sets (for example you are trying to build the unique set of some non-unique input).

[+] dagenix|7 years ago|reply
> In one setup, I used std::strings of length 64 and in the other I used 64 bit numbers. The main motivation behind measuring both numbers and strings was that comparing 64 bit numbers takes a constant number of clock cycles

Pedantic: comparing 64 character strings is O(1).

[+] mattkrause|7 years ago|reply
An ironic complaint, given the domain name…
[+] AstralStorm|7 years ago|reply
But not constant clock cycles unless the strings fit in the cache page. Which they may. Use bigger strings if that is the case.
[+] kryptiskt|7 years ago|reply
But not a constant number of clock cycles. A comparison that fails on the first bytes can bail out early.
[+] ridiculous_fish|7 years ago|reply
For longish strings (length 64 qualifies), binary_search is missing an important optimization: it ends up walking the same prefix over and over.

A better algorithm is called "radix quicksort with a three-way partition." Rather than the classical qsort < and >= partition, you maintain three regions of <, =, and >. Crucially, at each step, you sort based only on a single character: the whole range via the first char, then the = region via the second char, then that = region via the third char, etc. This allows for super-fast comparisons without re-walking prefixes.

All of these data structures (std::binary_search, std::set, and and set::unordered_set) are quite sub-optimal. The design of unordered_set with closed-hashing separately-allocated buckets is especially distressing.

For a much more efficient take, see LLVM's DenseSet which does NOT separately allocate buckets. This burdens the client with finding two values to represent empty and tombstone entries, but if perf is a concern that's much better than additional mallocs.

[+] iainmerrick|7 years ago|reply
It seems like many commenters here (the ones who aren’t questioning the methodology) are drawing the lesson that you should use vector rather than set.

From these charts, isn’t unordered_set clearly the best all-rounder? Which is exactly as I’d expect. Am I missing something?

Sure, a simple vector might be a bit faster for 10 or 100 or 1000 elements, but after a certain point performance is awful.

[+] adzm|7 years ago|reply
definitely check out flat_set too! most of the time using a sorted vector as the backing store for these kinds of things ends up being a net win, especially for smaller sets, but the size it starts to become a hindrance may surprise you.
[+] de_watcher|7 years ago|reply
Flat sets are great.

By writing my own implementation (to avoid boost dependency) I kinda understood why they aren't standard: iterators have different invalidation rules and some special sorted-vector-derived useful properties, the map/multimap keys can't be const as in other standard associative containers, etc.

[+] BenFrantzDale|7 years ago|reply
Very interesting. This makes it sound like there should be a way to control `std::lower_bound` to switch to linear probing once it's narrowed the range down enough. (In the specific example, "enough" is about 125 elements.)

Likewise for a hash table with separate chaining, wouldn't one better-than `std::unordered_set` way to do it be to keep the separate chains as a `std::vector` that is kept sorted, and then use `std::lower_bound` (or the above better version of it) to probe the table? Those separate tables are O(1) in size, so lookup and insertion are still O(1) on average. Worst case goes to O(n log n), but that doesn't seem too bad.

Related: Is there a widely-used "frozen set" C++ data structure? I'm picturing something that is constructed from a sequence but that internally is a sorted `std::vector<T>` that provides a set-like API but that uses `std::lower_bound` to find data. It's easy enough to write, but it seems like it should be in the standard or at least Boost.

[+] yoklov|7 years ago|reply
Implementing unordered map/set that way looses iterator/pointer stability, which is guaranteed by the current impl, unfortunately
[+] jlarocco|7 years ago|reply
The article should say which C++ compiler and standard library they're using. Each implementation is different, and these results may not be true for all of them.
[+] Jyaif|7 years ago|reply
Great! I'd love to know what percentage of the world's std::set could be replaced by vectors, and how many megawatts would be saved.
[+] halayli|7 years ago|reply
This is completely useless if you don't mention what STL library version you're using.

C++ sets define the interfaces and not the implementation.

[+] kccqzy|7 years ago|reply
It's actually not hard to beat the performance of standard set and map containers in specific scenarios. For example in one application I have a treap implementation that's much faster than the standard ordered map, but it is much less featureful (doesn't support deletion for example).
[+] petermcneeley|7 years ago|reply
Are these benchmarks done with cold caches and multithreading? If not then we are not seeing the whole picture here.
[+] mackal|7 years ago|reply
They use the Google Benchmark library. (I would read into it to answer your actual questions)
[+] firethief|7 years ago|reply
In the first table:

    set @ 4096: 10.80
    set @ 32768: 1.31
uh?
[+] wmu|7 years ago|reply
Comparing lookup time of unsorted vector and sorted set is unfair. The vector should be sorted and then we might compare plain binary search against BST performance.
[+] recentdarkness|7 years ago|reply
It is not unfair, and there is a comparison with A sorted vector as well. And the author correctly indicated the reason as of why the sorted vector is performing better than the unsorted vector. The cache friendliness is key. And it shows that for small sizes of elements an array or vector would perform better. It simply depends on your needs.

I would be interested in comparing an stack based array with integers to the sets as well.

[+] iainmerrick|7 years ago|reply
The article includes binary search of sorted vectors.