top | item 16434753

Scalable Bloom Filters (2007) [pdf]

126 points| setra | 8 years ago |citeseerx.ist.psu.edu

34 comments

[+] signa11|8 years ago|reply

this is a kind of tangential comment/rant :

but to me it seems that research papers _must_ be, for lack of a better term, runnnable. i would, and hopefully others as well, like to, replicate all these wonderful results that are advertised in these papers. without that, they are all just advertisements of scholarship rather than scholarship themselves. a set of instructions + environment which generated these figures would be very welcome.

<end-rant>

on the subject of bloom filters, have a look at this one: https://www3.cs.stonybrook.edu/~ppandey/files/p775-pandey.pd... (A General-Purpose Counting Filter: Making Every Bit Count)

[+] mceachen|8 years ago|reply

Agreed. First time this was in HN, I wanted to understand both types of bloom filters, so I wrote https://github.com/mceachen/bloomer.

[+] opencoff|8 years ago|reply

I used this paper for building a scalable bloom filter for use in an ad-tech stack. The performance was better than DJB's CDB.

https://github.com/opencoff/portable-lib

The bloom filter code is in src/bloom.c; the Header file is in inc/utils/bloom.h

I implemented a serialization/deserialization of the bloom filters as well (src/bloom_marshal.c).

The tests are in test/t_bloom.c.

[+] esturk|8 years ago|reply

Can someone help me understand the query part?

It says that a query is done on each BF, even on the ones that were added after the initial storage. So suppose we have only 2 iterations. In the first BF, there's k0 hash functions and in the 2nd (iteration) BF, there's now k1 hash functions.

So naturally, an item is stored using the k0 hash functions. But in order to query, I run against k1 hash functions which is a larger set. If any one of the k1-k0 extra hash functions returns 0, won't that be a false negative?

[+] cmurphycode|8 years ago|reply

Yes, you are right - you have to query each filter.

To keep the contract of the bloom filter, a "no" can only come if ALL filters return no. So if one of the filters return 0 and the other returns 1, the answer is maybe (i.e. a yes with some false positive probability). If we instead answered no, it would be a false negative as you stated, so we can't do that. This paper doesn't defeat this property of bloom filters.

The cool insight in this paper is that how you choose the new filter size allows for a relatively nice tradeoff of "wasted" size, and a target for the effective false positive ratio, even in the face of growth. We are increasing the probability for false positives, but depending on how you pick the sizes, you can do better than if you simply allocated another bloom of the same size, forever and ever.

(When I say "wasted" size, I mean the extra bits you need to get a certain false positive ratio, when you compare it to a properly sized filter from the get-go. In essence, you're paying some overhead when you get to a certain size. In exchange, you do not need to have guessed the size correctly / allocated all that memory from the get-go.)

[+] bistro17|8 years ago|reply

implementation of this paper - https://metacpan.org/pod/Bloom::Scalable

[+] stochastic_monk|8 years ago|reply

This should also be marked with a year. A cursory google search has StackOverflow answers from 2013.

A lack of a year label implicitly suggests that it’s new, e.g. Hacker News. Please add a tag with the appropriate year.

[+] grzm|8 years ago|reply

Looks like it's from 2007:

https://www.sciencedirect.com/science/article/pii/S002001900...

[+] howitworks|8 years ago|reply

This should be labeled as a PDF

[+] dang|8 years ago|reply

Yes. Added.

[+] bhouston|8 years ago|reply

Hacker news loves upvoting articles about bloom filters (and Bayesian probability - its on the front page again this evening.) Personally I've never found a use for either of them in practice.

[+] susam|8 years ago|reply

We used bloom filters at RSA Security to develop space-efficient indexing.

Problem: For a network security product, we stored security data in a distributed, hierarchical, time-series, key-value database. Creating a full inverted index to improve the performance of our SQL-like querying engine would increase storage space requirements forcing customers to buy more storage hardware.

Solution: Bloom filters helped us to perform fast probabilistic lookups to check whether a specific value for a specific key has a high likelihood of being in a particular database block while requiring lesser storage. During query time, if the bloom filter says that the value does not exist in a specific block, we safely skip that block thus reducing search time. If the bloom filter says that the value might exist in a specific block, then we parse that block to see if the value indeed exists. With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007), and a new bloom filter for every 1000 values (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

Summary: Overall, bloom filters led to a 30-fold increase in query performance while requiring only 1/25th the space a full inverted index would take.

[+] hatred|8 years ago|reply

They are upvoted because they are extremely nifty and useful :)

Check out their usage in Cassandra: http://cassandra.apache.org/doc/latest/operating/bloom_filte...

There should be many more you can find via quick googling.

[+] traek|8 years ago|reply

Bloom filters are popular mostly because they’re clever, but they can also be useful.

E.g. Google Chrome’s safe browsing feature used Bloom filters to check whether a site might be flagged for malware, and if necessary query Google to check if it is actually flagged. It would have been impractical to download the full safe browsing dataset to every client, and also impractical to query Google for every single site visited. Bloom filters were a good solution (note: they’ve since moved to a custom data structure[1] which is also interesting.)

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=71832

[+] biesnecker|8 years ago|reply

I’ve used them in production several times, recently when I needed to quickly check to see if a set of numbers were likely all the IDs of a particular kind of object, where the ID space is 64-bit ints and the population is in the hundreds of millions. We didn’t need perfection, just a bound error rate and predictable memory consumption, and a pre-loaded BF was perfect in this role.

[+] cryptonector|8 years ago|reply

They have one use: as a fast check for whether you should bother with a slower check. Bloom filters are basically a sort of a pre-cache.

[+] 0xCMP|8 years ago|reply

It's used in this awesome extension to know if a page has been submitted to HN before without making any network requests.

https://chrome.google.com/webstore/detail/hacker-news-filter...

[+] foo101|8 years ago|reply

> Hacker news loves upvoting articles about bloom filters (and Bayesian probability - its on the front page again this evening.) Personally I've never found a use for either of them in practice.

Perhaps that speaks about the type of work you are involved in rather than about the Hacker News community?

[+] dsacco|8 years ago|reply

> Personally I've never found a use for either of them in practice.

One of my servers batch downloads approximately 550GB of data per day. I use a bloom filter to deduplicate the file significantly faster than naive Unix sort or awk ‘!X[$0]++’. It also has superior asymptotic complexity.

[+] btown|8 years ago|reply

Extremely useful for anyone doing real-time analytics or filtering. Answering "has this person visited this page before" or "how many distinct people have visited this page/website" can be done exactly with map-reduce or streaming pipelines, but if you want an immediate answer with low latency and can tolerate false negatives/bounded inaccuracy, you can simply keep a Bloom filter or HyperLogLog in memory. For example, see the top comment here: https://news.ycombinator.com/item?id=12124722

[+] 3131s|8 years ago|reply

I use them for natural language processing. There are cases where the performance increase is enormous (as in I wouldn't be able to afford the computation without them) and the downside of false positives is or can be made irrelevant.

[+] jxub|8 years ago|reply

We use Bloom filters at Datamaran (https://www.datamaran.com) to check if the news site source is in one of the ones we follow quickly.

[+] unknown|8 years ago|reply

[deleted]

[+] unknown|8 years ago|reply

[deleted]

[+] peteretep|8 years ago|reply

Any in fact anything that sounds sort of computer sciencey.

[+] loeg|8 years ago|reply

I thought I might have a use for them in practice, but it turns out they are only useful for sparse datasets — if your dataset can be fully populated, a bitmap is more efficient (for any reasonable degree of accuracy).

[+] unknown|8 years ago|reply

[deleted]