top | item 47108228

(no title)

vlmutolo | 8 days ago

This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.

It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.

Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.

There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:

1. Hash-0 gets you the block index

2. Hash-1 through hash-k get you the bits inside the block

If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.

discuss

sakras|8 days ago

Yeah I kind of think authors didn't conduct a thorough-enough literature review here. There are well-known relations between number of hash functions you use and the FPR, cache-blocking and register-blocking are classic techniques (Cache-, Hash-, and Space-Efficient Bloom Filters by Putze et. al), and there are even ways of generating patterns from only a single hash function that works well (shamelessly shilling my own blogpost on the topic: https://save-buffer.github.io/bloom_filter.html)

I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.

thomasmg|8 days ago

Your blogpost is great! Except for one detail: you have used modulo n. If n is not known at compile time, multiply+shift is much faster [1]. Division and modulo (remainder) are slow, except on Apple silicon (I don't know what they did there). BTW for blocked Bloom filters, there are some SIMD variants that seem to be simpler than yours [2] (maybe I'm wrong, I didn't look at the details, just it seems yours uses more code). I implemented a register-based one in one in Java here [3].

Bulk insertion: yes, if there are many keys, bulk insertion is faster. For xor filters, I used radix sort before insertion [4] (I should have documented the code better), but for fuse filters and blocked Bloom filters it might not be worth it, unless if the filter is huge.

[1] https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-... [2] https://github.com/FastFilter/fastfilter_cpp/blob/master/src... [3] https://github.com/FastFilter/fastfilter_java/blob/master/fa... [4] https://github.com/FastFilter/fastfilter_cpp/blob/master/src...

vlmutolo|7 days ago

Very interesting blog post. I’d never seen that method for quickly computing the patterns. I thought I had done a lot of research on bloom filters, too!

holoskii|5 days ago

Post author here. Yes, you are correct. I was doing this code change 3 years ago when I was a junior dev, I was not familiar with a blocked bloom filters at that time. Looking back, it’s cool to see that I accidentally reinvented a basic blocked bloom.

I was also limited by the constraint of legacy code. This project was not a complete rewrite, but just an idea: "can we use more information from that 32-bit hash that we recieve in without regressing any perf". We didn't have a time for a deep research or a rewrite, so I just wanted to show the result of this small exercise on how we can make things run better without rewriting the world.

Sesse__|8 days ago

> Overall, I think block bloom filters should be the default most people reach for.

I think this depends on how big your filters are. Most people think of Bloom filters as having to have hundreds of thousands of elements, but I frequently find them useful all the way down to 32 bits (!). (E.g., there are papers showing chained hash tables where each bucket has a co-sited tiny Bloom filter to check if it's worth probing the chain.) In the “no man's land” in-between with a couple ten thousand buckets, the blocking seems to be mostly negative; it only makes sense as long as you actually keep missing the cache.

vanderZwan|7 days ago

Are you talking about Cuckoo++ tables, perhaps? If not can you point me to the hash table you had in mind? Always fun to learn of a new approach.

https://github.com/technicolor-research/cuckoopp

vlmutolo|7 days ago

Yeah, I agree with this. I think there are open addressing hash tables like Swiss Table that do something similar. IIRC, they have buckets with a portion at the beginning with lossy “fingerprints” of items, which kind of serve a similar purpose as a bloom filter.

hinkley|7 days ago

Bloom filters are useful for sharding so it stands to reason that a hash table implemented with shards would benefit.

hinkley|7 days ago

Problem is bloom isn’t close to the theoretical space complexity of the idea it implements and if you add 15% then it starts becoming attractive to switch to one that gets a tighter bound on the space complexity.

SolarNet|7 days ago

Cause it was written by AI.the entire mid section is classic AI slop writing. Repeating the same points and numbers over and over, repackaging the same idea with "key takeaway" and shit. The voice of the author is heavily AI coded there.