top | item 40410380

(no title)

gwillen | 1 year ago

Merging like that doesn't work -- it will tend to overestimate the number of distinct elements.

This is fairly easy to see, if you consider a stream with some N distinct elements, with the same elements in both the first and second halves of the stream. Then, supposing that p is 0.5, the first instance will result in a set with about N/2 of the elements, and the second instance will also. But they won't be the same set; on average their overlap will be about N/4. So when you combine them, you will have about 3N/4 elements in the resulting set, but with p still 0.5, so you will estimate 3N/2 instead of N for the final answer.

I have a thought about how to fix this, but the error bounds end up very large, so I don't know that it's viable.

discuss

order

No comments yet.