[edit: i was going to delete this, but since you replied i'll leave it - it does appear (too?) in the camel book, on p 246 of my copy, but like you say, it's for a single line. hadn't opened that book in years, took me some time to find it...]
(so it presumably consumes the entire stream before giving any results; any alternative i can think of would not be "really random" unless you knew the length of the stream in advance).
the magic of reservoir sampling is the memory footprint is the size of the output while being completely random. In this particular implementation the order of the output is slightly biased but each row has an equal chance of being in the output.
Yep, though a feature request I've put in is to respond to a ctl-c by producing the results from the stream so far... that way if it's taking a while on a large file you can interrupt and still get something useful.
Maybe I'm overlooking something obvious but wouldn't piping through awk 'rand()<.1{print}' give a decent streaming random sample of roughly 10% of input?
I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.
neilk|13 years ago
https://github.com/neilk/misc/blob/master/randline
I've had this script (or versions of it) around for more than a decade. I didn't know the technique had a name.
andrewcooke|13 years ago
[edit: i was going to delete this, but since you replied i'll leave it - it does appear (too?) in the camel book, on p 246 of my copy, but like you say, it's for a single line. hadn't opened that book in years, took me some time to find it...]
snoble|13 years ago
andrewcooke|13 years ago
(so it presumably consumes the entire stream before giving any results; any alternative i can think of would not be "really random" unless you knew the length of the stream in advance).
snoble|13 years ago
avibryant|13 years ago
cgs1019|13 years ago
nullc|13 years ago
bo1024|13 years ago
nburger|13 years ago
patrick_grant|13 years ago
taltman1|13 years ago
snoble|13 years ago