Dimsum is like head or tail, but pulls randomly from the entire file or stream

neilk|13 years ago

This doesn't come with an elaborate test suite, but it does pretty much everything that does in a few dozen lines of Perl.

https://github.com/neilk/misc/blob/master/randline

I've had this script (or versions of it) around for more than a decade. I didn't know the technique had a name.

andrewcooke|13 years ago

it was an example in the original camel book.

[edit: i was going to delete this, but since you replied i'll leave it - it does appear (too?) in the camel book, on p 246 of my copy, but like you say, it's for a single line. hadn't opened that book in years, took me some time to find it...]

snoble|13 years ago

the code isn't special. but it's easy for anyone to install and use

andrewcooke|13 years ago

uses reservoir sampling -http://en.wikipedia.org/wiki/Reservoir_sampling

(so it presumably consumes the entire stream before giving any results; any alternative i can think of would not be "really random" unless you knew the length of the stream in advance).

snoble|13 years ago

the magic of reservoir sampling is the memory footprint is the size of the output while being completely random. In this particular implementation the order of the output is slightly biased but each row has an equal chance of being in the output.

avibryant|13 years ago

Yep, though a feature request I've put in is to respond to a ctl-c by producing the results from the stream so far... that way if it's taking a while on a large file you can interrupt and still get something useful.

cgs1019|13 years ago

Maybe I'm overlooking something obvious but wouldn't piping through awk 'rand()<.1{print}' give a decent streaming random sample of roughly 10% of input?

nullc|13 years ago

uhhh. You mean like shuf -n NNNN ?

bo1024|13 years ago

I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.

nburger|13 years ago

or "sort -R"...

patrick_grant|13 years ago

I really don't like how this behaves for populating the array initially, and how it behaves for small inputs...

  $ seq 15 | dimsum  -n 10 
  14
  12
  3
  4
  5
  6
  7
  8
  9
  10

taltman1|13 years ago

Looks like a valid sample to me. Are you bothered by the ordering of the sample members? Then I'd continue the pipeline to include a call to shuf.

snoble|13 years ago

yup, the order is biased. definitely worth a couple of minutes to fix that

29 comments