bdupras | 8 years ago | on: Probabilistic Filters By Example
bdupras's comments
bdupras | 8 years ago | on: Probabilistic Filters By Example
bdupras | 8 years ago | on: Probabilistic Filters By Example
Re multisets, yes cuckoo filters and counting blooms are bounded - cuckoos by entries or bucket x 2, and blooms by bits per counter. Cuckoo's counting capability is more a side effect of the design. It's really only interesting in that you get limited counting in roughly the same bit space as a non-counting bloom. I suppose you could add counting bits to a cuckoo to support higher bounds in a similar manner to counting blooms.
I'm curious to hear more on your point on K-hashing in bloom filters - would you expand on your statement about only needing a K-bit hashing function? I'm happy to update the text on the tutorial to be more clear/exact.
bdupras | 8 years ago | on: Probabilistic Filters By Example
This is a down-side to cuckoos (and counting blooms) - even when the filter has reasonable capacity remaining, an entry can be denied due to collision with previous entries and saturation of their slots.
The UX on this demo could be better - e.g. not insert into the bloom unless the cuckoo insert is successful (to keep them from diverging). Also, a message indicating an unsuccessful insertion would be nice.
bdupras | 8 years ago | on: Probabilistic Filters By Example
As for collisions, only a limited number of colliding entries can be inserted into a cuckoo - after which a subsequent insertion will knowingly fail. So say Alice and Bob hash out to the same values, and you've inserted 2 Alices and 1 Bob. When you delete one Alice, it doesn't matter which entry is removed - two entries still remain that hash to Alice and Bob.
Perhaps the nuance you mention is that the alternate bucket index for an entry is calculated only from the fingerprint, not from the entry data. This has the effect that even when multiple entries share fingerprints, any one can be removed. (Again, _only_ if the system requesting the delete positively knows that the entry was previously successfully inserted.) The other colliding entry will still be found on query and will generate a "maybe" response.
bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters
This is both a feature and a limitation. In our use case, our filter grows predictably, and every few months a rejected insertion triggers a resize/rebuild from source data. Because we could tolerate a full rebuild, we chose for now not to implement the technique of growing the filter by adding successively larger filter segments with lower fpp guarantees, though that is an option.
bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters
bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters
The constant factors here are important for practical use. Hashing 7 times is significantly different than hashing 1 or 2 times.
For the original commenter, give me a suggestion if you don't mind. How would you expect to see that info depicted in a comparison chart so that it's honest and obvious?
bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters
As for false negatives, this can happen in both Cuckoo and Counting Bloom filters if a value that is never added is removed, and the value would have resulted in a false positive on query. The contract for removal is that you must know that the value is currently in the filter. I'll clarify that in the text.
bdupras | 9 years ago | on: Probabilistic Filters By Example: Cuckoo Filter and Bloom Filters
In some cases, like the when there's a high likelihood of duplicate entries, this can represent a catastrophic failure mode. This is a good point, and when I get around to updating the tutorial, I'll make note of this.
Regarding the failure mode, an insertion can be knowingly rejected without mutating the filter, so the filter remains useful for subsequent insertions and all reads, retaining all its guarantees. The only loss is the rejection of the item - in our case we used this as signal to rebuild the filter at a larger size.