(no title)
bqe | 4 years ago
The reason I made this was an old Amazon interview question. The question was basically, "Find the median of a huge data set without sorting it," and the "correct" answer was to have a fixed size sorted buffer and randomly evict items from it and then use the median of the buffer. However, a candidate I was interviewing had a really brilliant insight: if we estimate the median and move it a small amount for each new data point, it would be pretty close. I ended up doing some research on this and found P^2, which is a more sophisticated version of that insight.
cyral|4 years ago
[1]: https://arxiv.org/pdf/1902.04023.pdf / https://github.com/tdunning/t-digest
djk447|4 years ago
Yeah, that was one of the reasons we chose it as one of the ones to implement, seemed like that was a really interesting tradeoff, we also used uddsketch[1] which provides relative error guarantees, which is pretty nifty. We thought they provided different enough tradeoffs that we wanted to implement both.
[1]: https://arxiv.org/abs/1908.10693
jeremysalwen|4 years ago
riskneutral|4 years ago
[1] http://www.numericalexpert.com/articles/single_pass_stat/
breuleux|4 years ago
fotta|4 years ago
[0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-icde.pd...
convolvatron|4 years ago
tantalor|4 years ago
So, not find, but approximate. That's a different thing.
rendaw|4 years ago
(that you sort yourself)
djk447|4 years ago
Thanks for sharing! Hadn't heard of that algorithm, have seen a number of other ones out there, we chose a couple that we knew about / were requested by users. (And we are open to more user requests if folks want to use other ones! https://github.com/timescale/timescaledb-toolkit and open an issue!)
Cd00d|4 years ago
I had a basic screening call fail once because the expected answer was (in my perspective) more naive than my answer. I'd love it if generating curiosity were an interview +1.
williamdclt|4 years ago
bo1024|4 years ago
Yeah, this is gradient descent on the absolute loss.
eru|4 years ago
You can use eg QuickSelect (https://en.wikipedia.org/wiki/Quickselect) or Median of Medians.
They don't sort the data, but they do need linear amount of storage.
unknown|4 years ago
[deleted]
unknown|4 years ago
[deleted]
Izikiel43|4 years ago
Isn't this done using a min heap and a max heap in conjuction?
ilikebits|4 years ago
tantalor|4 years ago
1. huge data set: heaps requires storing the whole set, but "huge" means "more than you can store"
2. without sorting it: heaps are basically semi-sorted, so you are breaking the rules
hnfong|4 years ago
The original problem is probably more accurately described as “what if you have too much data to sort” though.
5faulker|4 years ago
ZephyrBlu|4 years ago
unknown|4 years ago
[deleted]
hikerclimber1|4 years ago
[deleted]