top | item 11420050

(no title)

Vitaly | 10 years ago

Your solution requires space to hold all items from the pool

discuss

order

acqq|10 years ago

I believe it would be easy to modify it to have only k chosen indexes in memory, and for k significantly smaller than n and n an order of 2 to 64 it would still be much faster than iterating through all n (which is the first algorithm presented here:

https://en.wikipedia.org/wiki/Reservoir_sampling )

acqq|10 years ago

However note that in the Vitter's paper all discussed algorithms are O(k) as per previous notation, or O(n) using paper's notation where n elements are selected from the set of N:

http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf

In short, it's worth reading Vitter's paper, the original post is just a post containing the C code, but the discussion of the algorithm and the comparison with the alternatives is in the paper.

(Edit: corrected the notation of O() as robrenaud suggested)