(no title)
curiouscoding | 1 year ago
Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)
Bulat_Ziganshin|1 year ago
1. For merge sort of N elements, you have to perform log(N)/log(R) passes
2. For sample sort - C*log(N)/log(R), where ะก depends on the distribution of your data, there are no strict guarantees
3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value
While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.
So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.
Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.