(no title)
reichstein | 1 year ago
- You have a buffer. You initially try to fit every unique word on there. - If the buffer gets full, you know that you can't fit all the unique words in there. So you decide to keep only a fraction, _p1_, of them. Run through the buffer, keep each value with probability _p1_. - Keep adding new words, again only with probability _p1_. - If the buffer gets full again, _p1_ was too large, so you choose a lower fraction, _p2_, retain existing elements only with probability _p2_/_p1_, and continue adding new words with probability _p2_. - Every time the buffer gets full, you lower the faction of words you try to retain.
The choice of using _pn_ = (1/2)^n is just easy for a computer, it only needs entire random bits at each step.
What I _don't_ get is how this is correct for counting unique words. If I have a text of 16384 unique words, then I can accept that each will be in the final list with the same probability. But if I take that list and repeat the last word 30000 times, then it becomes overwhelmingly plausible that that word is in the final list, even though I haven't changed the number of unique words. Maybe there is some statistical property that evens things out, but I couldn't see it from the article.
reichstein|1 year ago
If the input is known to be large, there is no reason to start by adding every element. It can treat the first round like any other, and start out with a _p0_ that is smaller than 1.