Understanding Bloom filters by building one

[+] hoppla|3 years ago|reply

Last time I used bloom filters, was when I had MANY BIG files containing dns queries and responses. Instead of putting the data in a database and create indexes (project would not the defend the cost of having a database), I created a bloom filter per file. Whenever I needed to search for a query or response, I would quickly know which files I needed to look in.

I think I used this module https://github.com/remram44/python-bloom-filter

[+] neoneye2|3 years ago|reply

In my spare time project I use a bloom filter, here: https://github.com/loda-lang/loda-rust/blob/develop/rust_pro...

I use 4 bloomfilters for rejecting programs that aren't present in the oeis database. First 10 terms are computed, and reject if the terms aren't present in the 1st bloomfilter. Next 10 terms more are computed, and rejected if the 20 terms aren't present in the 2nd bloomfilter. All the way up to 40 terms. If a mined program has possible 40 correct terms, there is a chance that the program may compute the entire oeis integer sequence correct.

Shameless self promotion: If you have cpu resources to spare, then please contribute to LODA, an open source project for mining oeis integer sequences. https://loda-lang.org/

[+] sorokod|3 years ago|reply

Google Guava has industrial strength implementation documented here[1] with sources here [2].

Tests and hints on how to drive this: [3]

[1] https://guava.dev/releases/snapshot-jre/api/docs/com/google/...

[2] https://github.com/google/guava/blob/master/guava/src/com/go...

[3] https://github.com/google/guava/blob/master/guava-tests/test...

[+] ashdnazg|3 years ago|reply

I've used Bloom filters in a Battlecode [1] AI. We had a shared array where reads and writes are expensive, so a bloom filter was used in order to not waste time on writing something that's already written there.

[1] https://battlecode.org/

[+] celeritascelery|3 years ago|reply

I have wondered if bloom filters could be used as a cache on hash tables. It really comes down to if the bloom filter hashes can be computed significantly faster then doing the hashmap lookup (which includes hashing and probing).

[+] bo1024|3 years ago|reply

1. Remember that a bloom filter only answers yes/no questions. So it can't cache hash table values, but it can cache set membership (is X in the hash table?).

2. Rather than speed per se, a better reason is space. You have a small bloom filter that lives close to the query. The query asks "is this entry in the hash table?" and get a local answer from the bloom filter, either "no" or "maybe yes". The answer is fast not because the hashing is fast, but because the small size allows it to be local and fast. Only for the "maybe yes" answers do you look it up in the hash table which is large and lives far away. This saves a lot of time if most answers are "no".

[+] vaibhavsagar|3 years ago|reply

Doing a hashmap lookup is already very much like using a bloom filter: for any addressing scheme, if an element exists at a particular location, then either the key exists or it's a false positive. If no element exists, then there's no way the key exists. Adding an extra bloom filter would then come down to trading off space usage against a different probability of false positives, in addition to an extra lookup, and I think one would have to contrive an extremely unlikely scenario for it to make sense.

[+] aaronblohowiak|3 years ago|reply

that is a very common application of them, afaiu. though, the size and locality of the hashtable plays a significant role here. i may be taking a too generalized view of what you mean by "hash table"

[+] fxtentacle|3 years ago|reply

Yes. I forgot where, but I saw one game engine have bitmaps that where exactly 64 bytes as a super quick pre-check that fits into one CPU cache line => much faster than RAM access.

[+] newbye3|3 years ago|reply

My first impression is that a better algorithm is obtained with one vector for each hash function, but now I am lazy to do the math.

[+] smegsicle|3 years ago|reply

it would be a bit more precise of course, but would use three times the memory at any hash size...

[+] TrumpRapedWomen|3 years ago|reply

[deleted]

15 comments