top | item 43071598

(no title)

I want to elaborate on timing cautions: a sort that specializes to 4-byte machine ints is doing something that can't be expressed directly in CPython. Its lists are heterogeneous, types aren't known until runtime, and there is no native (to Python) concept of "4-byte int". All CPython ints are stored in "bigint" (unbounded) format, and even zero requires 32 bytes of storage in the format (overheads for a type pointer, refcount, 16-byte heap alignment padding). That format isn't 2's complement either (although it creates the semantic _illusion_ of being such): it's sign+magnitude.

The general object comparison machinery is very expensive at runtime, essentially repeated from scratch on every compare. To mitigate this, CPython does a prepass over the list (O(n)) to see whether the list is in fact homogenous, and, if so, of a type that can use a more specialized comparison function. Not free, but pays off hugely if you are in fact sorting a list of "small enough" machine ints. But it doesn't change the storage format: on every compare, cycles are still consumed to transform CPython's sign+magnitude bigint format to native 2's-complement for native machine code to compare directly.

I expect this accounts for most of the observed "speedup" over timsort. Despite semi-heroic efforts to cut the cost, comparisons of native ints in CPython will always be significantly more expensive than code specialized to compare native ints directly. At least until (if ever) CPython learns how to "unbox" ints. The PyPy implementation of Python already does, and sorting lists of native machine ints runs about twice as fast under PyPy. It's not the sorting algorithm that matters to this, it's the relative cost of machine-int compares, and almost certainly too the cache pressure of using many more bytes to store small ints than necessary in "boxed" format.

As mentioned elsewhere, JesseSort reminds me most of the old and simpler "patience sorting", with which it has a lot in common. Both keep a collection of sorted sublists (to be merged at the end), both use binary search to pick a collection member to which to add 'the next" input, both excel at "low diversity" inputs, both (on average) end up with O(sqrt(n)) sublists on randomly ordered inputs, and neither is "naturally" stable.

As I noted in listsort.txt at the time, while timsort gets a lot of value out of repeated elements ("low diversity"), it's just not the right approach to exploit that aa much as possible. JesseSort and pdqsort (and, by extension, glidesort) will always do better on that. OTOH, timsoet gets value out of many kinds of partial ordering, which "reveal themselves" dynamically via one run in a merge "winning" consistently.

discuss

No comments yet.