That is very strange. I can not reproduce your results in C++, using your code.
On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2 MinGW64) my algorithm performs best and your branchless version ends up slower than your branchy one:
Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is generating inferior code for AMD/Windows, as at least on Intel/Linux your Lomuto branchless code can beat the branchy code. But pdqsort (which uses branchless Hoare partitioning) consistently beats both.
nightcracker|5 years ago
That is very strange. I can not reproduce your results in C++, using your code.
On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2 MinGW64) my algorithm performs best and your branchless version ends up slower than your branchy one:
On an Intel university machine (Xeon E5-2667 v2 @ 3.3GHz, 64 bit Ubuntu 16.04 LTS, GCC 5.4.0) I get: Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is generating inferior code for AMD/Windows, as at least on Intel/Linux your Lomuto branchless code can beat the branchy code. But pdqsort (which uses branchless Hoare partitioning) consistently beats both.I didn't change the benchmark, I only added pdqsort to yours, my full changes are the simple inclusion of 1 header and 3 lines of code: https://github.com/orlp/lomuto/commit/29381176f49e3588f6882c...
In order for any interested readers to reproduce, simply clone https://github.com/orlp/lomuto and run the above commands.