(no title)
c0deb0t | 1 year ago
What I'm wondering is why the Kraken2 probabilistic hash table doesn't work. It uses 32 bits per element in an open addressing hash table. For 1 billion k-mers and 19 bits for the value, 32 - 19 = 13 bits of the key hash can be stored alongside the value, helping disambiguate hash collisions. If the load factor is 1.25x, then that's 4 * 10^9 * 1.25 = 5GB total, better than ~7GB. Also, this is only one cache miss (+ linear probing that can be SIMD accelerated) per lookup.
boyd|1 year ago
Yes, exactly.
> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.
I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:
* Like the B-field, there are also false positives.
* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.
* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.
So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.
For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.