top | item 28527307

(no title)

bqe | 4 years ago

Awhile ago I wrote a Python library called LiveStats[1] that computed any percentile for any amount of data using a fixed amount of memory per percentile. It uses an algorithm I found in an old paper[2] called P^2. It uses a polynomial to find good approximations.

The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.

[1]: https://github.com/cxxr/LiveStats

[2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf

discuss

order

cyral|4 years ago

There are some newer data structures that take this to the next level such as T-Digest[1], which remains extremely accurate even when determining percentiles at the very tail end (like 99.999%)

[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest

djk447|4 years ago

NB: Post author here.

Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.

[1]: https://arxiv.org/abs/1908.10693

breuleux|4 years ago

That's pretty neat! Can these be used to efficiently compute rolling percentiles (over windows of the data), or just incremental?

convolvatron|4 years ago

i think the new ones started wtih Greenwald-Khanna. but i definately agree - p^2 can be a little silly and misleading. in particular it is really poor at finding those little modes on the tail that correspond to interesting system behaviours.

tantalor|4 years ago

> Find the median ... randomly evict items

So, not find, but approximate. That's a different thing.

rendaw|4 years ago

> without sorting it... have a fixed size sorted buffer

(that you sort yourself)

djk447|4 years ago

NB: Post author here.

Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)

Cd00d|4 years ago

Did the candidate get an offer? Genuinely curious.

I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.

williamdclt|4 years ago

Whether they got it or not probably isn't useful information. Having a good/brilliant answer probably isn't the only point of the question, this probably wasn't the only question of the interview, and this probably wasn't the only interview

bo1024|4 years ago

> if we estimate the median and move it a small amount for each new data point, it would be pretty close.

Yeah, this is gradient descent on the absolute loss.

eru|4 years ago

> The question was basically, "Find the median of a huge data set without sorting it," [...]

You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.

They don't sort the data, but they do need linear amount of storage.

Izikiel43|4 years ago

> The question was basically, "Find the median of a huge data set without sorting it,"

Isn't this done using a min heap and a max heap in conjuction?

ilikebits|4 years ago

The real constraint here is probably "find the median of a huge data set without holding the entire data set in memory".

tantalor|4 years ago

No, for 2 reasons,

1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"

2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules

hnfong|4 years ago

The two heap method helps you maintain the current median given an incoming stream of “online” data, and technically is not “sorting”.

The original problem is probably more accurately described as “what if you have too much data to sort” though.

5faulker|4 years ago

It's funny that this is often left out from a data & algorithm class.

ZephyrBlu|4 years ago

How does this get you the median?