Sorting in C++: 3 times faster than C

[+] tikhonj|14 years ago|reply

Couldn't you replicate most of C++'s optimizations by hand in C? Because that's what I always took to be the real reason C is fast--not because pretty code is fast, but because you can make the code as ugly as you want (hello instrinsics!) to optimize it.

For example, couldn't you just inline the relevant code manually in C? Sure it would be ugly, but it would be fast. Besides, in an absolute sense pretty much any C or C++ code is going to be ugly; if you want pretty code and a compiler that valiantly tries to optimize it, use something like Haskell :P.

[+] veyron|14 years ago|reply

This example highlights the reasons to avoid function pointers when possible.

An apples to apples comparison would use a handwritten sort routine for C instead of using a library function: STD::sort is closer to writing your own sort and putting in the function calls than calling an external function.

[+] baq|14 years ago|reply

of course you can. the whole point is that you don't have to.

[+] udp|14 years ago|reply

Or rather, "Sorting with std::sort: 3 times faster than qsort".

[+] osolo|14 years ago|reply

Exactly. std::sort does NOT use quick sort. So all he's doing is comparing two different algorithms. It has nothing to do with C vs C++.

[+] thedigitalengel|14 years ago|reply

The point is, unless you are using some macro magic in C, you'll almost always have to make such a compromise to get an abstracted out sort function.

[+] cygx|14 years ago|reply

To level the playing field, I took the glibc quicksort implementation from http://sourceware.org/git/?p=glibc.git;a=blob_plain;f=stdlib... (which is actually a hybrid algorithm), modified it to use a comparison function with signature _Bool(const void* , const void* ) as we're only interested in whether an element is less than another, and included the source code verbatim (which is essentially what including <algorithm> does).

The C++ code still outperformed the C code, but the ratio dropped from ~2.8 to ~1.4.

40% performance gain is nothing to sneeze at, but it's far less impressive than the sensational 3x from the headline.

The remaining difference is probably due to the choice of algorithm: the GNU STL algorithm is introsort. While it's also a hybrid algorithm based on quicksort, it adds heapsort to the mix, which is known to outperform quicksort on random data.

[+] unknown|14 years ago|reply

[deleted]

[+] lzm|14 years ago|reply

How can one "limit branching", and "optimize for the second level memory cache"?

[+] twoodfin|14 years ago|reply

One way to limit branching is to try to make applications of each operation as temporally contiguous as possible. That is, rather than take each data item and apply each operation to it, structure the operations so they can each be applied to the entire data set, one after the other. The branch predictor is happy because you're predictably looping over a simpler piece of code. The I-cache is happy because you're looping over a smaller piece of code.

Optimizing for the second level memory cache means making your data access patterns predictable (the above suggestion helps here, too) and keeping your data compact to get as much use out of each cache line as possible: Do you really need a 64-bit size_t to store an index into an array that will never be larger than a few thousand elements?

[+] hanbam|14 years ago|reply

I split the code in two binaries: "code" (qsort) and "code2" (std::sort()) and then I ran both under a profiler (based on intel's performance counters).

It seams that qsort simply executes an order of magnitude more instructions for the same result than std::sort. On the other hand std::sort() code, even if it's faster, it has more branch miss-predictions.

Here [1] are the results if you want to have a look.

[1] https://gist.github.com/b97a6395c9a20c3d4cea

[+] radiospiel|14 years ago|reply

Limit branching: code like "a < b ? 1 : b < a ? -1 : 0" (yes, this is more or less my comparison function) is bad, because the CPU has to predict 2 comparisons, and CPU prediction os probably right at only ~50%. If prediction fails, then the queue of already "parsed" instructions gets flushed, and the CPUs processing unit has to wait for new instructions. Code which does not branch usually runs faster.

Therefore (a-b) would be a better comparison function, if you guarantee it does not overflow.

[+] infiniteburp|14 years ago|reply

Loop unroll for limit branching. You can sort what's in cache, then load a new chunk of data into cached and sort it ... then merge the sorted chunks.

[+] chj|14 years ago|reply

every once in a while some one would come out with this kind of stupid study. please do not post these results any more, it is just a waste of time.

[+] radiospiel|14 years ago|reply

Well, every once in a while someone comes out with "C++ is slow because of vtables". Which is false. If those guys would stop posting this I would stop writing what is true (but probably not too relevant for most screnarios, yes.)

[+] unknown|14 years ago|reply

[deleted]

[+] livebeef|14 years ago|reply

This was also explained in the book "Programming Pearls".

[+] VMG|14 years ago|reply

Shouldn't he be using getrusage() instead of gettimeofday() in his timer code (https://gist.github.com/2057981)?

[+] psyren|14 years ago|reply

When writing profiling code (https://github.com/vstefan/Instrum/), I use clock_gettime(CLOCK_MONOTONIC) as it measures elapsed time from an arbitrary reference point.

42 comments