People interested in this might also want to check out simmer, which provides a simple unix stdin/stdout interface for this class of bounded summaries of unbounded streams. Quantiles are actually a weak point for simmer right now (I have a TODO item to add Q-Digest), but it has a bunch of other useful sketches implemented, via the Algebird library which I also hack on. Would love to get feedback & patches. https://github.com/avibryant/simmer
Hey, this is very close to something I've been needing recently (and in Go, nonetheless).
Is there any way to get a similar thing for a sliding window of a stream? For example, to be able to report (estimated) 90th percentile latencies for server requests in the last 5 minutes, hour, and day.
This is a really awesome problem! I tackled this for my work at TempoDB and ended up going with the Q-Digest algorithm although I took a good look at CKMS. Really cool to see this implements merging streams, I remember reading that CKMS was more difficult to merge streams than Q-Digest.
Distributing the computation was quite easy. I emailed the authors of the paper and they gave me a quick answer, which is what I implemented and tested.
I'm interested in adding more implementations of the problem to perks, like Q-Digest, along with other streaming data problems.
I don't really understand what this does, I didn't find a "for dummies" section on the site - can anyone give a real-world use case and (ideally) a comparison with some other system that does the same thing?
Basically, when you have more data than memory and time to sort them in order to find the percentile you're looking for, you need to employ an algorithm that trades rank selection accuracy for lower memory and CPU costs. This package does that.
avibryant|13 years ago
kingfishr|13 years ago
Is there any way to get a similar thing for a sliding window of a stream? For example, to be able to report (estimated) 90th percentile latencies for server requests in the last 5 minutes, hour, and day.
bmizerany|13 years ago
bmizerany|13 years ago
azmenthe|13 years ago
If anyone is interested this was my write up for algorithm selection: http://blog.tempo-db.com/post/42318820124/estimating-percent...
bmizerany|13 years ago
Distributing the computation was quite easy. I emailed the authors of the paper and they gave me a quick answer, which is what I implemented and tested.
I'm interested in adding more implementations of the problem to perks, like Q-Digest, along with other streaming data problems.
vosper|13 years ago
bmizerany|13 years ago
Basically, when you have more data than memory and time to sort them in order to find the percentile you're looking for, you need to employ an algorithm that trades rank selection accuracy for lower memory and CPU costs. This package does that.