top | item 20093806

The Bloom Clock

216 points| g0xA52A2A | 6 years ago |arxiv.org | reply

43 comments

order
[+] irwt|6 years ago|reply
Hi everybody. I'm the author of the paper. Let me know if you have any questions or comments.
[+] vermorel|6 years ago|reply
Interesting perspective! Do you have any insights whether it would work - or could be modified to work - under adversarial behavior? I.e. assuming that a fraction of the nodes are byzantine.
[+] showdead|6 years ago|reply
The paper talks about dynamic networks where nodes are being added or removed. If m and k are fixed, does it set an effective upper bound on the number of nodes? How do you address the possibility of multiple nodes having colliding hash values?
[+] akashakya|6 years ago|reply
> one can increment a single value outside the bloom filter to keep the values in the bloom filter low. Example: Instead of storing bloom filter [4,3,3,5,7,4,3,3,5], one could store it as (3)[1,0,0,2,4,1,0,0,2].

CMIIW, This only works if it's guaranteed that the reset (transition from high values to low) is propagated to all the nodes. Which is not the case in a distributed system. Like, if a node is not reachable for a long time and every other node is reset except this one then its possible to have a state in which we can have false negatives.

[+] karlmcguire|6 years ago|reply
Thanks for writing this.

I'd love to hear your input on a possible application for this. I'm not sure if you're aware, but Counting Bloom Filters are useful as space efficient counters in LFU cache admission/eviction [1].

I'm wondering if bloom clocks would be a good data structure for a space efficient LRU admission/eviction. Full LRU uses a linked list and sampled LRU just uses random sampling, but my intuition is that bloom clocks might perform similarly or better than sampled LRU...

The difficult part will be dealing with concurrency. I suppose it'd be easy to treat threads as "nodes", but not all languages have the luxury of thread-local storage (Go), so I'll cross that bridge when I get there.

Anyways, I'm going to test it out later - thought you might be interested in my thoughts on a possible use.

[1]: https://arxiv.org/abs/1512.00727

[+] pjscott|6 years ago|reply
This is such an elegant idea that I literally laughed with joy for a few seconds when I read the abstract. You made my morning.
[+] remcob|6 years ago|reply
> Each time a node has an internal event, it hashes that event with k hash functions and increments its bloom filter.

Isn't this effectively the same as generating k random indices? In other words, would the bloom filter work equally well if we pick k random indices to update?

In regular bloom filters this won't work because we want identical elements to collide, but here that doesn't appear to be a concern. I can see how we might want identical nodes to collide (i.e. have the same node use the same indices over and over), but that is not what is going on.

[+] madmax96|6 years ago|reply
Hi!

My concern is:

>Each time a node has an internal event, it hashes that event with k hash functions and increments its bloom filter. It then sends that bloom filter to all other nodes.

I would suspect that the cost of broadcasting the bloom filter to the network would generally outweigh the cost of serializing a vector clock and sending it to a single node. It seems that this step in the algorithm is critical to this approach.

Do you have thoughts on this problem?

[+] 0xdeadbeefbabe|6 years ago|reply
Thanks for sharing the paper.

I have a general question, is it best to update the clock after the event or just before or does it matter?

How often do the nodes communicate?

[+] api|6 years ago|reply
I'd love a summary "hackers' explanation" if you have time, or a link to some good sample code.
[+] JD557|6 years ago|reply
I think understand the advantages of using a vector clock instead of a timestamp and the advantages of a bloom clock vs vector clocks.

But since a bloom clock can sometimes return an incorrect order of events (false positives), wouldn't using the timestamp be a simpler solution on most cases that can allow some unordered events?

[+] irwt|6 years ago|reply
Author of the paper here. Imagine it like this, if your bloom filter overlaps by a lot an incoming bloom filter, what you can do is go over your logs, find the bloom filter closest by increments to the incoming bloom filter, and then compare them. If the events are not comparable, i.e. concurrent, you will know it immediately. If they are comparable, your false positive rate now will be extremely low (because you picked the bloom filter in your logs that is closest to the one you received).
[+] jashmenn|6 years ago|reply
It's important to keep in mind that the reason we use vector clocks in the first place is because timestamps also can return the incorrect order of events because of time drift, delivery times, etc.
[+] magnamerc|6 years ago|reply
I feel like this could be useful in DAG data structures.
[+] filmfact|6 years ago|reply
Isn’t this a hyperloglog?
[+] smudgymcscmudge|6 years ago|reply
I’ve only read the abstract, but I think this does ordering where hyperloglog does presence.
[+] mfatica|6 years ago|reply
Hyperloglog is an algorithm for approximating distinct elements in a set. This is focused on keeping time.
[+] jonititan|6 years ago|reply
This is an intriguing paper. This seems highly related to proof of work algorithms in blockchain systems. Working out which events happened and at what time.
[+] heavenlyblue|6 years ago|reply
>> This seems highly related to proof of work algorithms in blockchain systems. Working out which events happened and at what time.

This is not a byzantine algorithm so it's got nothing to do with proof of work.

[+] tgtweak|6 years ago|reply
Or distributed databases that are vulnerable to clock skew.