top | item 23217109

(no title)

andralex | 5 years ago

Thanks! I tried that too, but it's slower than Lomuto. Forgot to mention in the article.

discuss

> I tried that too, but it's slower than Lomuto.

That is very strange. I can not reproduce your results in C++, using your code.

On my machine (Threadripper 2950x, 64 bit Windows 10, GCC 10.1.0 from MSYS2 MinGW64) my algorithm performs best and your branchless version ends up slower than your branchy one:

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && a 10000000
    min_milliseconds=828.1250
    median_milliseconds=890.6250

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && a 10000000
    min_milliseconds=671.8750
    median_milliseconds=671.8750

    $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && a 10000000
    min_milliseconds=343.7500
    median_milliseconds=343.7500

On an Intel university machine (Xeon E5-2667 v2 @ 3.3GHz, 64 bit Ubuntu 16.04 LTS, GCC 5.4.0) I get:

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO lomuto.cpp && ./a.out 10000000
    min_milliseconds=514.8051
    median_milliseconds=536.1984

    $ g++ -std=c++17 -O3 -DNDEBUG -DLOMUTO_BRANCHY lomuto.cpp && ./a.out 10000000
    min_milliseconds=688.2471
    median_milliseconds=698.9603

    $ g++ -std=c++17 -O3 -DNDEBUG -DPDQSORT lomuto.cpp && ./a.out 10000000
    min_milliseconds=340.1935
    median_milliseconds=344.7432

Maybe AMD or Windows doesn't like your branchless code or perhaps GCC is generating inferior code for AMD/Windows, as at least on Intel/Linux your Lomuto branchless code can beat the branchy code. But pdqsort (which uses branchless Hoare partitioning) consistently beats both.

I didn't change the benchmark, I only added pdqsort to yours, my full changes are the simple inclusion of 1 header and 3 lines of code: https://github.com/orlp/lomuto/commit/29381176f49e3588f6882c...

In order for any interested readers to reproduce, simply clone https://github.com/orlp/lomuto and run the above commands.