top | item 40393919

(no title)

The way I'd describe it is:

- You have a buffer. You initially try to fit every unique word on there. - If the buffer gets full, you know that you can't fit all the unique words in there. So you decide to keep only a fraction, _p1_, of them. Run through the buffer, keep each value with probability _p1_. - Keep adding new words, again only with probability _p1_. - If the buffer gets full again, _p1_ was too large, so you choose a lower fraction, _p2_, retain existing elements only with probability _p2_/_p1_, and continue adding new words with probability _p2_. - Every time the buffer gets full, you lower the faction of words you try to retain.

The choice of using _pn_ = (1/2)^n is just easy for a computer, it only needs entire random bits at each step.

What I _don't_ get is how this is correct for counting unique words. If I have a text of 16384 unique words, then I can accept that each will be in the final list with the same probability. But if I take that list and repeat the last word 30000 times, then it becomes overwhelmingly plausible that that word is in the final list, even though I haven't changed the number of unique words. Maybe there is some statistical property that evens things out, but I couldn't see it from the article.

discuss

reichstein|1 year ago

And I got it. When the algorithm sees a word that is already in the list, it discards the word in the list first. Then it adds the word again with the same probability as any other word. This ensures that only the last occurrence of each word can occur in the final list, so the final occurrence of each word are all in the final list with the same probability, and prior occurrences are always removed, if no earlier then when the next occurrence is seen.

If the input is known to be large, there is no reason to start by adding every element. It can treat the first round like any other, and start out with a _p0_ that is smaller than 1.