(no title)
nulptr | 5 years ago
If each seed is a 32-bit int, and each hash is 60 bits, then you'll have to store 92 bits for each (seed, hash) pair.
With 2^32 (~4.3 billion) (seed, hash) pairs, you'd need 2^32 * 92 bits ~= 50 gigabytes.
But you mention only needing 8-12GB. Is my calculation wrong, or have I missed something?
Thorrez|5 years ago
nulptr|5 years ago
But, instead of using a hash table to store (seed, hash) pairs, we can just use a list of only hashes. To generate the next hash we use current hash. i.e. next_hash = hash("example.com/" + string(int64(cur_hash))). So now, we just store 60-bit hash values which takes up 2^30 * 60 bits = 8gigs.
The only other problem is detecting if a hash already exists, which I think can be done using a space efficient probabilistic hashset like a bloom filter (and when the bloom filter says the hash already exists, we do a linear scan to find where it exists and we have a possible collision).
manulp|5 years ago
bluesign|5 years ago
unknown|5 years ago
[deleted]