top | item 44707546

SIMD within a register: How I doubled hash table lookup performance

193 points| axeluser | 7 months ago |maltsev.space

40 comments

order

curiouscoding|7 months ago

Some questions/remarks:

- in the initial `Contains` code snippet (and all early-break variants), most performance is probably lost by branch misses on which element returns true. Probably much better is something like `table[0] == fp | table[1] == fp | table[2] == fp | table[3] == fp` (which might actually get optimized to int/SIMD anyway; who knows).

- What is shown in the table with benchmarks? I assume 'operations' is the number of lookups, but how large is the array? Memory latency is probably one of the big effects on overall runtime? - It's probably more useful to show time per lookup than mean 'total time'.

- Also, how do you run the benchmarks? If you make a fixed array of queries and query those many times, the same cachelines will be hit and you won't measure memory latency. Also, in this case the branchpredictor might learn all branches, so best is to only do each query once in your benchmark.

- You don't really comment on the differences between benches with different number of 'operations'. Are there any takeaways from this? (otherwise just don't show them?)

- Maybe you could try with some `u8x8` SIMD instructions as well? Although the bitmasking is also cute as-is :)

axeluser|7 months ago

Hi, thank you for your questions: 1. The "one-line" boolean condition still has jumps in the produced JIT Asm, at least to make a short circuit to exit as soon as possible. I guess the C# compiler is far too conservative for such optimisations.

2 and 3. The "operations" are both the array's size (actually the nearest power of 2) and lookup operations. I first thought that latency should be equal for both cases, because my benchmarks may be too naive - they test lookup by sequential memory access, i.e., iterating buckets from 0 to n, and iterating from first to last element in each bucket. I'll try to experiment with a more shuffled workload in my free time. Thank you for the question. I've not thought about it deeply. Also, counting mispredictions is a good idea; I'll add this metric to my benchmarks to get a complete picture.

4. I didn't pay much attention to that because the trend was still the same, and even the ratio was similar for different counts of lookups per underlying memory layout. There was just some minor deviation, but I ignored it. Maybe I'll run more granular tests with hardware counters to catch some insights.

5. SIMD, in my opinion, would be an overhead there because for a single lookup in my configuration, I'm using a single 32-bit value, and Cuckoo Filter will touch at max 64 bits (2 lookups). However, I was planning to experiment with batching of lookups; I just haven't reached it yet.

curiouscoding|7 months ago

Another remark:

If you benchmark it as something like `for q in queries { contains(q); }`, especially the branchless variants are probably executed in parallel by the CPU, and you are measuring throughput instead of latency. That may or may not be relevant depending on the application.

mizmar|7 months ago

Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented). Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).

Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.

I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering. Yet the super fast probe with apparently made it not an issue? Mind-boggling. (Later version allowed to scan from arbitrary position by mirroring first bucket as last.)

jandrewrogers|7 months ago

A simple linear probe is very efficient if the hash quality is high enough. More complex hash tables have the useful property that their performance is less sensitive to lower hash quality.

Historically, the computational cost of excellent hash quality was too high, so the overall cost of a hash table was lower using a worse hash and more complex table design. Today, it is possible to generate high quality hashes with minimal computational cost, so complex hash table designs are less useful. Modern hash table performance can largely be reduced to the average number of cache lines (both data and instruction) touched by the operations.

Sesse__|7 months ago

The absl tables (the canonical Swisstables) are good, but not the fastest. Of course, everything will depend on your hardware, compiler and usage patterns, but if you look at an all-round benchmark like https://martin.ankerl.com/2022/08/27/hashmap-bench-01/, they're about in the middle of the pack even when picking the best-working hash functions (and really bad if you happen to have a bad one). This mirrors my own experience.

Of course, absl tables will have been improved since 2022, but there are many competitors that have improved as well. And you could argue “well, who ever copies or iterates over a map”, but even if you take away those categories, they're not a clear leader really anywhere else either.

tialaramex|7 months ago

The Swiss Table is not presently the fastest known design for many purposes, but other open addressed schemes take that honour today and yes, nobody who wants a fast modern hash table tries to mitigate collision - just use a decent hash and the collisions are rare enough to have negligible effect so long as your scheme degrades rather than failing.

The C++ approach is focused around mitigating the price of collision. Accept you lost, and now try to make that not sting too bad. This made some sense because their provided hash for the integers is typically literally the identity function, collision is thus very frequent and predictable. But "Probably we lost, lets mitigate that" isn't a route to success and it only made sense when memory fetches were cheap and ALU operations were expensive. Forty years ago that was a sane choice, thirty years ago when C++ standardization was in progress it was already looking dodgy, fifteen years ago when they actually standardized this feature it was already very stupid.

Open addressed strategies begin by assuming you probably won. Your first main memory fetch is probably enough to know if this key is present, and if it is the next fetch probably gets the key and value.

The std::unordered_list strategy will only tell you if the key was not present on its first fetch some small fraction of the time, and on every other occasion you must do the second fetch to even discover whether the key might be present, several more fetches are needed on average to get a final key + value or certainty that it's not found.

dietr1ch|7 months ago

I think it just abuses on how reading a few contiguous cache lines at random isn't a whole lot more expensive than reading a single random byte.

I found myself never wanting to use binary heaps, and instead use a wider one with an arity that can use at least a whole cache line, if not multiple ones after I started experimenting with my priority queues. Wider nodes meant fewer jumps, and jumping between nodes was more expensive (time, cache misses) than the work done at each node.

adgjlsfhk1|7 months ago

tombstones use the same bit as empty (empty and deleted don't have hash codes, so there's room).

the key realization behind swissdict is that all the complications people add are only necessary if you have a bad hash function, so you can just use a good one and be happier.

linear probing also has the significant advantage that it allows removing tombstones when whenever the next entry is empty

teo_zero|7 months ago

> Later version allowed to scan from arbitrary position by mirroring first bucket as last.

I don't think this would help. The real issue with arbitrary position is that you can't load 16 bye to a 128-bit SIMD register if the memory is not aligned. The solution I found is to unroll the first iteration and mask out the results found before the initial offset.

Rietty|7 months ago

> Yet the super fast probe with apparently made it not an issue?

Could you explain how that makes it a non-issue? It seems counter-intuitive to me that it solves the problem by just probing faster?

Const-me|7 months ago

Great idea.

BTW, when I implementing similar logic in C#, I place expressions like ((xored - 0x01010101U) & ~xored & 0x80808080U) inside unchecked blocks. Because when I compile my C#, I set <CheckForOverflowUnderflow>true</CheckForOverflowUnderflow> compiler option.

On modern computers the performance impact of that option is negligible. I’ve encountered several cases where OverflowException produced by integer arithmetic exposed bugs which were trivial to fix, but would have been insanely hard to detect and debug without that compiler option.

qcnguy|7 months ago

That's cool. What sort of bugs are being found with the overflow checks that wouldn't also be caught by array range checks?

warangal|7 months ago

Nice write up! It is about speeding up underlying hash-table for a `cuckoo filter` , as in false-positives are allowed, unlike data-structure like dictionary, which makes sure that false-positive rate is 0.

Whenever i need to use dictionary/hash-table in a hot loop where maximum possible iterations are known, i use a custom hash-table initialized with that value, and in some cases, standard library implementation may store `hash` along with `key` and `value` in case it needs to `resize`, in my case i don't store the `hash` which leads to speed up about 15-20% due to less cache misses, as there would be no resize possible!

Also `cuckoo filters` seems to use `robinhood hashing` like strategy, using in a hot-loop it would lead to a higher insert cost ?

axeluser|7 months ago

Thank you! You are right. Cuckoo Filters use the open-addressing technique to deal with collisions, which may result in a long eviction chain on inserts or even fail if the load factor is high or just bad luck. This is one major drawback you should consider before using it. https://maltsev.space/blog/010-cuckoo-filters#fingerprints-a...

jadenPete|7 months ago

This is an off-topic and low-quality comment, but the animated dots on the background of this website made me think I had dust on my device for a second.

axeluser|7 months ago

Yeah, there's such an issue. I'll fix that later.

alwahi|7 months ago

okay wait, i thought hash table lookup was already O(1)...

drysine|7 months ago

O(1) only means that the lookup time doesn't depend on the number of elements in the table. It can be 10 nanoseconds or 10 days.

ot|7 months ago

Now it's O(0.5)