top | item 43062080

(no title)

tim-peters | 1 year ago

I want to thank you for your analysis! I'm the "timsort" guy, and I'm asked to look at all sorts of things. The devil is in the details, and I've given up bothering to look unless a paper spells out sufficient details up front. Else it's a pit I'd rather not get sucked into.

As general things, timsort was aimed at exploiting all kinds of pre-existing order, not random lists. The only goal for the latter was to be within reach of CPython's previous highly tuned samplesort implementation (like quicksort on steroids, with much less data movement than mergesorts, but more comparisons).

As you say, for randomly ordered input it gets remarkably close to the information-theoretic lower bound on # of compares. So that's not it.

"Cache effects", maybe, but that's a pit to dig into. I'll note that while the newer "powersort" merge strategy is elegant and provably near-optimal by some relevant measures, in practice it doesn't really appear to run any faster (although cases can be _contrived_ that make it - or the older method! - run faster).

Comparison specialization (to ints) could very well account for it. CPython's general comparison machinery is _very_ expensive, even for what turn out to be native machine ints (which CPython cannot know at compile-time: everything is deduced at runtime).

CPython has since grown new machinery to do a pre-pass over the list, and do a form of cheaper comparison specialization for what turn out to be homogeneous lists of suitable types (including "small enough" ints). That alone gives enough speedup to make some of the original choices sub-optimal. For example, binary insertion sort for short runs is no longer best then. That was aimed at minimizing worst-case # of compares, but as comparisons get cheaper that has less value.

There are also surprises in everything. For example, the worst case for binary insertion sort on most machines was _not_ reverse-ordered input, despite that it requires the most data movement. Instead the worst case was randomly ordered data. Why? Branch prediction. In randomly ordered data, each branch is unpredictable. In reverse-ordered data, "move to the left" is always the test outcome.

Another generality is that Python's sort cares more about speed for shorter lists than huge ones. "Big data" problems are more likely to use, e.g., extensions geared to "big data problems", like numpy for giant arrays of floats. In those contexts more suitable _algorithms_ exist, like radix sorts, or multi-key quicksorts for lists of long strings with long shared prefixes.

Python's niche is more in, e.g., web services, where a great many sorts of shorter lists are common.

Bottom line: there is no one "best" sorting algorithm. I keep an eye out for newer developments, but haven't seen anything better yet for what Python is aiming at. I was, e.g., very impressed by pdqsort - but it's not a stable sort, and that alone makes it a non-starter for general Python use.

discuss

order

mlochbaum|1 year ago

Whoa, hey Tim! Very much agree on no "best" sorting algorithm. The generic-comparison case is one I keep idly considering, although more focused on long lists than short ones. There's been a lot of improvement lately in sorting small datatypes (arguably kicked off by fluxsort), but really, who has so many 4-byte ints laying around that sorting them is a bottleneck? A point Orson Peters made when presenting glidesort[0] was that lists with many repeated duplicates are common even in generic data, as in ordering customers by city, and quicksort seems to be the best way to deal with such a list. However glidesort/driftsort is still largely oriented towards smaller types, and its mergesort-quicksort hybridization can pretty easily make bad choices, as I described at [1] and especially [2] (slightly surprised I didn't mention binary insertion sort in that first section). So this approach feels like it's still immature to me.

[0] https://archive.fosdem.org/2023/schedule/event/rust_glidesor...

[1] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

tim-peters|1 year ago

You're much more up to date on the details of recent developments than I am. It's faded into a background interest (although a persistent one) for me.

One thing that wasn't clear to me about JesseSort: in what way(z) are Rainbows believed to be an improvement over the simpler scheme used by the older "patience sorting"? They both maintain a collection of sorted sublists merged at the end, both use binary search to find the right sublist to add "the next" array element to, and both excel at "low diversity" inputs (if there are K distinct values, patience sorting will build at most K sublists).

As I recall, full-blown patience sorting isn't "naturally stable". Searching through the JesseSort paper, I didn't find it addressed, and the details were too fuzzy to guess offhand. While this is in no way a principled objection, it just _seemed_ "too complicated" to me. Then again, my intuition for such things has served me well before ;-)