but to me it seems that research papers _must_ be, for lack of a better term, runnnable. i would, and hopefully others as well, like to, replicate all these wonderful results that are advertised in these papers. without that, they are all just advertisements of scholarship rather than scholarship themselves. a set of instructions + environment which generated these figures would be very welcome.
It says that a query is done on each BF, even on the ones that were added after the initial storage. So suppose we have only 2 iterations. In the first BF, there's k0 hash functions and in the 2nd (iteration) BF, there's now k1 hash functions.
So naturally, an item is stored using the k0 hash functions. But in order to query, I run against k1 hash functions which is a larger set. If any one of the k1-k0 extra hash functions returns 0, won't that be a false negative?
Yes, you are right - you have to query each filter.
To keep the contract of the bloom filter, a "no" can only come if ALL filters return no. So if one of the filters return 0 and the other returns 1, the answer is maybe (i.e. a yes with some false positive probability). If we instead answered no, it would be a false negative as you stated, so we can't do that. This paper doesn't defeat this property of bloom filters.
The cool insight in this paper is that how you choose the new filter size allows for a relatively nice tradeoff of "wasted" size, and a target for the effective false positive ratio, even in the face of growth. We are increasing the probability for false positives, but depending on how you pick the sizes, you can do better than if you simply allocated another bloom of the same size, forever and ever.
(When I say "wasted" size, I mean the extra bits you need to get a certain false positive ratio, when you compare it to a properly sized filter from the get-go. In essence, you're paying some overhead when you get to a certain size. In exchange, you do not need to have guessed the size correctly / allocated all that memory from the get-go.)
Hacker news loves upvoting articles about bloom filters (and Bayesian probability - its on the front page again this evening.) Personally I've never found a use for either of them in practice.
We used bloom filters at RSA Security to develop space-efficient indexing.
Problem: For a network security product, we stored security data in a distributed, hierarchical, time-series, key-value database. Creating a full inverted index to improve the performance of our SQL-like querying engine would increase storage space requirements forcing customers to buy more storage hardware.
Solution: Bloom filters helped us to perform fast probabilistic lookups to check whether a specific value for a specific key has a high likelihood of being in a particular database block while requiring lesser storage. During query time, if the bloom filter says that the value does not exist in a specific block, we safely skip that block thus reducing search time. If the bloom filter says that the value might exist in a specific block, then we parse that block to see if the value indeed exists. With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007), and a new bloom filter for every 1000 values (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
Summary: Overall, bloom filters led to a 30-fold increase in query performance while requiring only 1/25th the space a full inverted index would take.
Bloom filters are popular mostly because they’re clever, but they can also be useful.
E.g. Google Chrome’s safe browsing feature used Bloom filters to check whether a site might be flagged for malware, and if necessary query Google to check if it is actually flagged. It would have been impractical to download the full safe browsing dataset to every client, and also impractical to query Google for every single site visited. Bloom filters were a good solution (note: they’ve since moved to a custom data structure[1] which is also interesting.)
I’ve used them in production several times, recently when I needed to quickly check to see if a set of numbers were likely all the IDs of a particular kind of object, where the ID space is 64-bit ints and the population is in the hundreds of millions. We didn’t need perfection, just a bound error rate and predictable memory consumption, and a pre-loaded BF was perfect in this role.
> Hacker news loves upvoting articles about bloom filters (and Bayesian probability - its on the front page again this evening.) Personally I've never found a use for either of them in practice.
Perhaps that speaks about the type of work you are involved in rather than about the Hacker News community?
> Personally I've never found a use for either of them in practice.
One of my servers batch downloads approximately 550GB of data per day. I use a bloom filter to deduplicate the file significantly faster than naive Unix sort or awk ‘!X[$0]++’. It also has superior asymptotic complexity.
Extremely useful for anyone doing real-time analytics or filtering. Answering "has this person visited this page before" or "how many distinct people have visited this page/website" can be done exactly with map-reduce or streaming pipelines, but if you want an immediate answer with low latency and can tolerate false negatives/bounded inaccuracy, you can simply keep a Bloom filter or HyperLogLog in memory. For example, see the top comment here: https://news.ycombinator.com/item?id=12124722
I use them for natural language processing. There are cases where the performance increase is enormous (as in I wouldn't be able to afford the computation without them) and the downside of false positives is or can be made irrelevant.
I thought I might have a use for them in practice, but it turns out they are only useful for sparse datasets — if your dataset can be fully populated, a bitmap is more efficient (for any reasonable degree of accuracy).
[+] [-] signa11|8 years ago|reply
but to me it seems that research papers _must_ be, for lack of a better term, runnnable. i would, and hopefully others as well, like to, replicate all these wonderful results that are advertised in these papers. without that, they are all just advertisements of scholarship rather than scholarship themselves. a set of instructions + environment which generated these figures would be very welcome.
<end-rant>
on the subject of bloom filters, have a look at this one: https://www3.cs.stonybrook.edu/~ppandey/files/p775-pandey.pd... (A General-Purpose Counting Filter: Making Every Bit Count)
[+] [-] mceachen|8 years ago|reply
[+] [-] opencoff|8 years ago|reply
https://github.com/opencoff/portable-lib
The bloom filter code is in src/bloom.c; the Header file is in inc/utils/bloom.h
I implemented a serialization/deserialization of the bloom filters as well (src/bloom_marshal.c).
The tests are in test/t_bloom.c.
[+] [-] esturk|8 years ago|reply
It says that a query is done on each BF, even on the ones that were added after the initial storage. So suppose we have only 2 iterations. In the first BF, there's k0 hash functions and in the 2nd (iteration) BF, there's now k1 hash functions.
So naturally, an item is stored using the k0 hash functions. But in order to query, I run against k1 hash functions which is a larger set. If any one of the k1-k0 extra hash functions returns 0, won't that be a false negative?
[+] [-] cmurphycode|8 years ago|reply
To keep the contract of the bloom filter, a "no" can only come if ALL filters return no. So if one of the filters return 0 and the other returns 1, the answer is maybe (i.e. a yes with some false positive probability). If we instead answered no, it would be a false negative as you stated, so we can't do that. This paper doesn't defeat this property of bloom filters.
The cool insight in this paper is that how you choose the new filter size allows for a relatively nice tradeoff of "wasted" size, and a target for the effective false positive ratio, even in the face of growth. We are increasing the probability for false positives, but depending on how you pick the sizes, you can do better than if you simply allocated another bloom of the same size, forever and ever.
(When I say "wasted" size, I mean the extra bits you need to get a certain false positive ratio, when you compare it to a properly sized filter from the get-go. In essence, you're paying some overhead when you get to a certain size. In exchange, you do not need to have guessed the size correctly / allocated all that memory from the get-go.)
[+] [-] bistro17|8 years ago|reply
[+] [-] stochastic_monk|8 years ago|reply
A lack of a year label implicitly suggests that it’s new, e.g. Hacker News. Please add a tag with the appropriate year.
[+] [-] grzm|8 years ago|reply
https://www.sciencedirect.com/science/article/pii/S002001900...
[+] [-] howitworks|8 years ago|reply
[+] [-] dang|8 years ago|reply
[+] [-] bhouston|8 years ago|reply
[+] [-] susam|8 years ago|reply
Problem: For a network security product, we stored security data in a distributed, hierarchical, time-series, key-value database. Creating a full inverted index to improve the performance of our SQL-like querying engine would increase storage space requirements forcing customers to buy more storage hardware.
Solution: Bloom filters helped us to perform fast probabilistic lookups to check whether a specific value for a specific key has a high likelihood of being in a particular database block while requiring lesser storage. During query time, if the bloom filter says that the value does not exist in a specific block, we safely skip that block thus reducing search time. If the bloom filter says that the value might exist in a specific block, then we parse that block to see if the value indeed exists. With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007), and a new bloom filter for every 1000 values (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
Summary: Overall, bloom filters led to a 30-fold increase in query performance while requiring only 1/25th the space a full inverted index would take.
[+] [-] hatred|8 years ago|reply
Check out their usage in Cassandra: http://cassandra.apache.org/doc/latest/operating/bloom_filte...
There should be many more you can find via quick googling.
[+] [-] traek|8 years ago|reply
E.g. Google Chrome’s safe browsing feature used Bloom filters to check whether a site might be flagged for malware, and if necessary query Google to check if it is actually flagged. It would have been impractical to download the full safe browsing dataset to every client, and also impractical to query Google for every single site visited. Bloom filters were a good solution (note: they’ve since moved to a custom data structure[1] which is also interesting.)
[1] https://bugs.chromium.org/p/chromium/issues/detail?id=71832
[+] [-] biesnecker|8 years ago|reply
[+] [-] cryptonector|8 years ago|reply
[+] [-] 0xCMP|8 years ago|reply
https://chrome.google.com/webstore/detail/hacker-news-filter...
[+] [-] foo101|8 years ago|reply
Perhaps that speaks about the type of work you are involved in rather than about the Hacker News community?
[+] [-] dsacco|8 years ago|reply
One of my servers batch downloads approximately 550GB of data per day. I use a bloom filter to deduplicate the file significantly faster than naive Unix sort or awk ‘!X[$0]++’. It also has superior asymptotic complexity.
[+] [-] btown|8 years ago|reply
[+] [-] 3131s|8 years ago|reply
[+] [-] jxub|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] unknown|8 years ago|reply
[deleted]
[+] [-] peteretep|8 years ago|reply
[+] [-] loeg|8 years ago|reply
[+] [-] unknown|8 years ago|reply
[deleted]