top | item 24747036

(no title)

nulptr | 5 years ago

I have a question about how you generated collisions.

If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.

With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.

But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?

discuss

order

Thorrez|5 years ago

Since the hash is 60 bits, a collision is expected after 2^30 tries due to the birthday paradox. So by your math this would be 12.3GB. Although this is ignoring hash table overhead, which I expect to be at least 2x. Maybe they just got lucky.

nulptr|5 years ago

I think I found a way to do it with ~8GB. As you mentioned you only need 2^30 tries due to birthday paradox.

But, instead of using a hash table to store (seed, hash) pairs, we can just use a list of only hashes. To generate the next hash we use current hash. i.e. next_hash = hash("example.com/" + string(int64(cur_hash))). So now, we just store 60-bit hash values which takes up 2^30 * 60 bits = 8gigs.

The only other problem is detecting if a hash already exists, which I think can be done using a space efficient probabilistic hashset like a bloom filter (and when the bloom filter says the hash already exists, we do a linear scan to find where it exists and we have a possible collision).

bluesign|5 years ago

You can also truncate hash at 30 bits and use it to address bucket, 12gb then would give you like 3x32 bits, which you can store your candidates. Then compare full hash with candidates.