top | item 43887790

(no title)

vlmutolo | 10 months ago

A simple extension of the Bloom filter called “block Bloom filters” fixes this. The idea is that the first hash is the index of a small constant-size block of your array, and the rest of the indices are within that block.

So a single query to the filter should only have one or two cache misses, depending on the size of your block. Or even if your block is larger than a cache line, you can probably issue all the loads at once and only pay for the latency of one memory access.

The downside of doing this is slightly more space usage relative to simple Bloom filters. I’d almost always reach for block Bloom filters, though, once the filter becomes a significant fraction of cache size.

I implemented block bloom filters for fairly large (~GB) arrays and saw about 35ns performance. They’re excellent data structures, pretty much as fast as you can get for approximate membership tests (though other filters have better space-time tradeoffs).

discuss

No comments yet.