I believe it would be easy to modify it to have only k chosen indexes in memory, and for k significantly smaller than n and n an order of 2 to 64 it would still be much faster than iterating through all n (which is the first algorithm presented here:
However note that in the Vitter's paper all discussed algorithms are O(k) as per previous notation, or O(n) using paper's notation where n elements are selected from the set of N:
In short, it's worth reading Vitter's paper, the original post is just a post containing the C code, but the discussion of the algorithm and the comparison with the alternatives is in the paper.
(Edit: corrected the notation of O() as robrenaud suggested)
acqq|10 years ago
https://en.wikipedia.org/wiki/Reservoir_sampling )
acqq|10 years ago
http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf
In short, it's worth reading Vitter's paper, the original post is just a post containing the C code, but the discussion of the algorithm and the comparison with the alternatives is in the paper.
(Edit: corrected the notation of O() as robrenaud suggested)
unknown|10 years ago
[deleted]