His xor-laden branchless min/max code in the fastest version so far, sort6_sorting_network_v4, is some seriously janky shit. It's making it harder for the compiler to generate good platform-specific machine code. That initial impression was confirmed after looking at the x64 assembly code generated by gcc -O3. It was awful. I rewrote his min/max macros to use plain old ternary operators. The generated assembly code looked a lot better, and my code beats his by 2-2.5x on my late 2008 MacBook Pro:
$ gcc -O3 sortnet -o sortnet && ./sortnet
sort6_sorting_network_v4: time = 988 cycles
sort6_sorting_network_per: time = 390 cycles
That win is without even trying. Just from eyeballing those cycle counts, I'd expect you could wring a lot more speed out of this. My coworker Fabian (@rygorous) immediately suggested using the old difference trick for SORT2(x,y) rather than calculating min and max independently. That brings it down further by about 50% in my benchmark. Fabian estimates that's still about 2.5 times off from optimal: "Best bit-magic solution is ~5 cycle dep chain, stupid simple mov / cmp / cmovg / cmovg boils down to 2."
His benchmarking harness also leaves a lot to be desired. But that's another issue. He also says that this is eventually intended for use on a GPU, but he's ranking the implementations by performance on a deeply pipelined out-of-order CPU. Modern GPU cores have some minimal scoreboarding but none of them do the full dance of register renaming and decoupled execution units.
If you do the naive implementation:
int temp = x;
x = y;
y = temp;
Then a sufficiently smart compiler can decide to assign, say, %eax to x and %ebx to y, and then just rename its notion of registers, and after the swap just begin using %ebx for x and %eax for y. A swap with no copies at all!
(It won't do this in every case, and it depends on context, but such an optimization is possible... but not with xor swaps.)
To be a little more speculative . . . if it is on GPU, is there some way of using rendering operations? The 'hidden-surface problem' in graphics has been described as basically a sorting problem: so represent each number as a triangle Z . . . well, maybe it would not be very fast, but you could do a few million at once!
Seven years ago, maybe. But today's GPUs do it easily in their programmable pipeline (shaders).
And where conditionals are costly, you can use tricks like:
float temp = min(x, y);
x = max(x, y);
y = temp;
instead of the conditional swapping operations, to make the whole network sorting algorithm deterministic. (Edit: notice that min() and max() here are hardware operations.)
The premise is interesting, but why compare results running on general-purpose CPUs when the goal is to run on a GPU? At this level of optimization, comparing two vastly different architectures and compilers would make any results pretty irrelevant.
[+] [-] psykotic|14 years ago|reply
His benchmarking harness also leaves a lot to be desired. But that's another issue. He also says that this is eventually intended for use on a GPU, but he's ranking the implementations by performance on a deeply pipelined out-of-order CPU. Modern GPU cores have some minimal scoreboarding but none of them do the full dance of register renaming and decoupled execution units.
[+] [-] 5hoom|14 years ago|reply
I'd never never heard of sorting networks or oblivious sorting before, and it's good to see some answers other than "Quicksort!"
[+] [-] giardini|14 years ago|reply
[+] [-] skimbrel|14 years ago|reply
[+] [-] rpearl|14 years ago|reply
Then a sufficiently smart compiler can decide to assign, say, %eax to x and %ebx to y, and then just rename its notion of registers, and after the swap just begin using %ebx for x and %eax for y. A swap with no copies at all!
(It won't do this in every case, and it depends on context, but such an optimization is possible... but not with xor swaps.)
[+] [-] onemoreact|14 years ago|reply
[+] [-] hxa7241|14 years ago|reply
[+] [-] pyrtsa|14 years ago|reply
And where conditionals are costly, you can use tricks like:
instead of the conditional swapping operations, to make the whole network sorting algorithm deterministic. (Edit: notice that min() and max() here are hardware operations.)[+] [-] drv|14 years ago|reply
[+] [-] mrcapers|14 years ago|reply
[+] [-] route66|14 years ago|reply
[+] [-] pmiller2|14 years ago|reply