Lessons from Hash Table Merging

exDM69|1 month ago

Maybe this would be a suitable application for "Fibonacci hashing" [0][1], which is a trick to assign a hash table bucket from a hash value. Instead of just taking the modulo with the hash table size, it first multiplies the hash with a constant value 2^64/phi where phi is the golden ratio, and then takes the modulo.

There may be better constants than 2^64/phi, perhaps some large prime number with roughly equal number of one and zero bits could also work.

This will prevent bucket collisions on hash table resizing that may lead to "accidentally quadratic" behavior [2], while not requiring rehashing with a different salt.

I didn't do detailed analysis on whether it helps on hash table merging too, but I think it would.

[0] https://probablydance.com/2018/06/16/fibonacci-hashing-the-o... [1] https://news.ycombinator.com/item?id=43677122 [2] https://accidentallyquadratic.tumblr.com/post/153545455987/r...

SkiFire13|1 month ago

> This will prevent bucket collisions on hash table resizing

Fibonacci hashing is really adding another multiplicative hashing step followed by dropping the bottom bits using a shift operation instead of the top bits using an and operation. Since it still works by dropping bits, items that were near before the resize will still be near after the resize and it won't really change anything.

willvarfar|1 month ago

Kudos, neat digging and writeup that makes us think :)

If you merge linear probed tables by iterating in sorted hash order then you are matching the storage order and can congest particular parts of the table and cause the linear probing worse case behaviour.

By changing the iteration order, or salting the hash, you can avoid this.

Of course chained hash tables don't suffer from this particular problem.

My quick thought is that hash tables ought keep an internal salt hidden away. This seems good to avoid 'attacks' as well as speeding up merging etc. The only downside I can think of is that the creation of the table needs to fetch a random salt that might not be quick, although that can alleviated by allowing it to be set externally in the table creation so people who don't care can set it to 0 or whatever. What am I missing?

kzrdude|1 month ago

Having a per-table key for the hash function is what siphash authors propose and what many do to combat dos attacks right? For example Rust's default HashMap.

The keys are hidden/secret to the system external to the application.

SkiFire13|1 month ago

> I evaluated the following hash table libraries, all based on linear probing.

> Abseil

> Rust standard

> hashbrown

These hash tables are not based on plain linear probing, they use something that's essentially quadratic probing done in chunks. Not sure about the others but they might be doing something similar.

attractivechaos|1 month ago

These three and boost are all based on swiss tables. They are indeed more robust than plain linear probing. khashl is the only one here using basic linear probing. Without salting, its curve is through the roof, much worse than swiss tables.

AlotOfReading|1 month ago

This easier to solve if you use a permutation instead of hash functions.

Let h0 be the larger table, and h1 the smaller. N = len(h0), M = len(h1). Pretend the elements of the tables are sequentially indexed. Element[0] is h0[0], Element[N] is h1[0], etc.

     h0' = Resize(h0, (N+M)*capacity_factor)

    for x in 0...(range):
      y = permute(x, 0, (N+M)*capacity_factor)
      if(y >= N) move_to(h0'[y], element[x])

One allocation and you move the minimum number of elements needed to eliminate primary clustering. Elements in h0 that aren't moved would presumably remain correctly indexed. You have to move the remaining elements of h1 as well, but that cluttered things.

Any randomish permutation works, from basic ones up to cryptographic. If your permutation only works on certain powers of two, iterate it until the result is in range.

tialaramex|1 month ago

In Rust, don't do this, it's more work and it'll tend to be slower, often much slower.

HashMap implements Extend, so just h0.extend(h1) and you're done, the people who made your HashMap type are much better equipped to optimize this common operation.

In a new enough C++ in theory you might find the same functionality supported, but Quality of Implementation tends to be pretty frightful.

OskarS|1 month ago

> HashMap implements Extend, so just h0.extend(h1) and you're done, the people who made your HashMap type are much better equipped to optimize this common operation.

Are you sure? I'm not very used to reading Rust stdlib, but this seems to be the implementation of the default HashMap extend [1]. It just calls self.base.extend. self.base seems to be hashbrown::hash_map, and this is the source for it's extend [2]. In other words, does exactly the same thing, just iterates through hash map and inserts it.

Maybe I'm misreading something going through the online docs, or Rust does the "random seed" thing that abseil does, but just blinding assuming something doesn't happen "because Rust" is a bit silly.

[1]: https://doc.rust-lang.org/src/std/collections/hash/map.rs.ht...

[2]: https://docs.rs/hashbrown/latest/src/hashbrown/map.rs.html#4...

khuey|1 month ago

From skimming the source code it looks like the merge operation here adds the values for duplicated keys rather than replacing the first value with the second value so using HashMaps's Extend impl won't work.

jesse__|1 month ago

> the people who made your HashMap type are much better equipped to optimize ...

Who's to say I'm not the one making the hashtable? There are plenty of real-world reasons the standard library hashtable may be either inaccessible or unsuitable.

Furthermore, the idea that "oh, honey, it's too hard, smart people did it for you" is insufferable and needs to die. When I'm the one making something, I have dramatically more information about the problem I'm trying to solve than the author of a hashtable library, and am therefore much better equipped to make design decision tradeoffs.

Please stop perpetuating the idea that 'just use a library' is unilaterally the best option. Sometimes, it's not.

wehateclusters|1 month ago

[deleted]

oleggromov|1 month ago

There's a typo with 'ULL' string suffixes in the hexadecimal numbers in the first code example.

rurban|1 month ago

No, 0xd6e8feb86659fd93ULL is a valid unsigned long long number. With stdint.h you'd get portable suffix macros, which would help on non-Windows, but they do look worse.

21 comments