Maybe you'd know, but why would one choose to not sort favoring larger counts and drop the bottom half when full? It may be obvious to others, but I'd be curious.
The guarantees would not hold, I'm pretty sure ;) Maybe one of the authors could chip in, but my hunch is that with that you could actually introduce arbitrarily large errors. The beauty of this algorithm really is its simplicity. Of course, simple is.. not always easy. This absolute masterpiece by Knuth should demonstrate this quite well:
It's an absolutely trivial algorithm. Its average-case analysis is ridiculously hard. Hence why I think this whole Ordo obsessions needs to be refined -- worst case complexity has often little to do with real-world behavior.
Worst case complexity matters when the input data can be manipulated by someone malicious, who can then intentionally engineer the degenerate worst case to happen - as we have seen historically in e.g. denial of service attacks exploiting common hash table implementations with bad worst case complexity.
You want every distinct item to have the same chance at the end. So when items repeat you need to reduce (not increase) the odds of keeping any given occurrence.
Let’s prove it by contradiction:
Lets say you pick the larger ones and drop the smaller ones every single round, you have lost the probabilistic guarantee of 1/2^k that the authors show because the most frequent words will be the lost frequent in subsequent rounds as well. This is the intuition, the math might be more illuminating.
zero_k|1 year ago
https://www.sciencedirect.com/science/article/pii/0022000078...
It's an absolutely trivial algorithm. Its average-case analysis is ridiculously hard. Hence why I think this whole Ordo obsessions needs to be refined -- worst case complexity has often little to do with real-world behavior.
PeterisP|1 year ago
lokar|1 year ago
throwaway14356|1 year ago
that would seem simpler to me.
edit: oh but then you would need to keep the results which defeats the purpose
sufiyan|1 year ago