top | item 42566708

(no title)

Ah yes, I did consider sorting the queries at some point but I guess I then forgot about it again. If the queries are random and much less than the input size, probably most queries will hit different cache lines in the final layer, and I suspect benefits will be limited at maybe 2x best case since the last layer is where the bottleneck is.

Also (radix) sorting is very memory bound usually, and we probably need to sort in at 16^2=256 buckets or so to get sufficient reusing of the higher layers, but I don't have the numbers of what a round of radix sort takes. (My guess is order 1ns per query? Maybe I'll find time to investigate and add it to the post.)

discuss

Bulat_Ziganshin|1 year ago

high-performance sorting algos do either merging or partitioning. I.e., you merge R input streams into one, or split one input stream into R (for quick, radix and sample sort).

1. For merge sort of N elements, you have to perform log(N)/log(R) passes

2. For sample sort - C*log(N)/log(R), where С depends on the distribution of your data, there are no strict guarantees

3. For radix sort of K-byte elements exactly K passes (indeed, 256 ways is optimal according to my tests), which is usually larger than the previous value

While Mergesort looks preferable since it uses the least and fixed amount of passes, the merging code itself is less efficient - there are not much independent CPU operations, making it slower than samplesort and especially radix sort for small inputs.

So, it seems that the best strategy combines two different levels - for larger blocks you absolutely need to minimize the amount of passes [over memory] and employ multithreading, while for smaller blocks we need to minimize the amount of CPU operations and increase ILP.

Today two well-recognized algos are IPS4O for larger blocks and Google's SIMD QuickSort for smaller ones.