top | item 40394422

(no title)

sesteel | 1 year ago

Maybe you'd know, but why would one choose to not sort favoring larger counts and drop the bottom half when full? It may be obvious to others, but I'd be curious.

discuss

zero_k|1 year ago

The guarantees would not hold, I'm pretty sure ;) Maybe one of the authors could chip in, but my hunch is that with that you could actually introduce arbitrarily large errors. The beauty of this algorithm really is its simplicity. Of course, simple is.. not always easy. This absolute masterpiece by Knuth should demonstrate this quite well:

https://www.sciencedirect.com/science/article/pii/0022000078...

It's an absolutely trivial algorithm. Its average-case analysis is ridiculously hard. Hence why I think this whole Ordo obsessions needs to be refined -- worst case complexity has often little to do with real-world behavior.

PeterisP|1 year ago

Worst case complexity matters when the input data can be manipulated by someone malicious, who can then intentionally engineer the degenerate worst case to happen - as we have seen historically in e.g. denial of service attacks exploiting common hash table implementations with bad worst case complexity.

lokar|1 year ago

You want every distinct item to have the same chance at the end. So when items repeat you need to reduce (not increase) the odds of keeping any given occurrence.

throwaway14356|1 year ago

does that mean you could also split the set in half multiple times then run it on each half of a half (etc) and combine it with its other half?

that would seem simpler to me.

edit: oh but then you would need to keep the results which defeats the purpose

sufiyan|1 year ago

Let’s prove it by contradiction: Lets say you pick the larger ones and drop the smaller ones every single round, you have lost the probabilistic guarantee of 1/2^k that the authors show because the most frequent words will be the lost frequent in subsequent rounds as well. This is the intuition, the math might be more illuminating.