Couldn't you replicate most of C++'s optimizations by hand in C? Because that's what I always took to be the real reason C is fast--not because pretty code is fast, but because you can make the code as ugly as you want (hello instrinsics!) to optimize it.
For example, couldn't you just inline the relevant code manually in C? Sure it would be ugly, but it would be fast. Besides, in an absolute sense pretty much any C or C++ code is going to be ugly; if you want pretty code and a compiler that valiantly tries to optimize it, use something like Haskell :P.
This example highlights the reasons to avoid function pointers when possible.
An apples to apples comparison would use a handwritten sort routine for C instead of using a library function: STD::sort is closer to writing your own sort and putting in the function calls than calling an external function.
To level the playing field, I took the glibc quicksort implementation from http://sourceware.org/git/?p=glibc.git;a=blob_plain;f=stdlib... (which is actually a hybrid algorithm), modified it to use a comparison function with signature _Bool(const void* , const void* ) as we're only interested in whether an element is less than another, and included the source code verbatim (which is essentially what including <algorithm> does).
The C++ code still outperformed the C code, but the ratio dropped from ~2.8 to ~1.4.
40% performance gain is nothing to sneeze at, but it's far less impressive than the sensational 3x from the headline.
The remaining difference is probably due to the choice of algorithm: the GNU STL algorithm is introsort. While it's also a hybrid algorithm based on quicksort, it adds heapsort to the mix, which is known to outperform quicksort on random data.
One way to limit branching is to try to make applications of each operation as temporally contiguous as possible. That is, rather than take each data item and apply each operation to it, structure the operations so they can each be applied to the entire data set, one after the other. The branch predictor is happy because you're predictably looping over a simpler piece of code. The I-cache is happy because you're looping over a smaller piece of code.
Optimizing for the second level memory cache means making your data access patterns predictable (the above suggestion helps here, too) and keeping your data compact to get as much use out of each cache line as possible: Do you really need a 64-bit size_t to store an index into an array that will never be larger than a few thousand elements?
I split the code in two binaries: "code" (qsort) and "code2" (std::sort()) and then I ran both under a profiler (based on intel's performance counters).
It seams that qsort simply executes an order of magnitude more instructions for the same result than std::sort. On the other hand std::sort() code, even if it's faster, it has more branch miss-predictions.
Here [1] are the results if you want to have a look.
Limit branching: code like "a < b ? 1 : b < a ? -1 : 0" (yes, this is more or less my comparison function) is bad, because the CPU has to predict 2 comparisons, and CPU prediction os probably right at only ~50%. If prediction fails, then the queue of already "parsed" instructions gets flushed, and the CPUs processing unit has to wait for new instructions. Code which does not branch usually runs faster.
Therefore (a-b) would be a better comparison function, if you guarantee it does not overflow.
Well, every once in a while someone comes out with "C++ is slow because of vtables". Which is false. If those guys would stop posting this I would stop writing what is true (but probably not too relevant for most screnarios, yes.)
When writing profiling code (https://github.com/vstefan/Instrum/), I use clock_gettime(CLOCK_MONOTONIC) as it measures elapsed time from an arbitrary reference point.
[+] [-] tikhonj|14 years ago|reply
For example, couldn't you just inline the relevant code manually in C? Sure it would be ugly, but it would be fast. Besides, in an absolute sense pretty much any C or C++ code is going to be ugly; if you want pretty code and a compiler that valiantly tries to optimize it, use something like Haskell :P.
[+] [-] veyron|14 years ago|reply
An apples to apples comparison would use a handwritten sort routine for C instead of using a library function: STD::sort is closer to writing your own sort and putting in the function calls than calling an external function.
[+] [-] baq|14 years ago|reply
[+] [-] udp|14 years ago|reply
[+] [-] osolo|14 years ago|reply
[+] [-] thedigitalengel|14 years ago|reply
[+] [-] cygx|14 years ago|reply
The C++ code still outperformed the C code, but the ratio dropped from ~2.8 to ~1.4.
40% performance gain is nothing to sneeze at, but it's far less impressive than the sensational 3x from the headline.
The remaining difference is probably due to the choice of algorithm: the GNU STL algorithm is introsort. While it's also a hybrid algorithm based on quicksort, it adds heapsort to the mix, which is known to outperform quicksort on random data.
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] lzm|14 years ago|reply
[+] [-] twoodfin|14 years ago|reply
Optimizing for the second level memory cache means making your data access patterns predictable (the above suggestion helps here, too) and keeping your data compact to get as much use out of each cache line as possible: Do you really need a 64-bit size_t to store an index into an array that will never be larger than a few thousand elements?
[+] [-] hanbam|14 years ago|reply
It seams that qsort simply executes an order of magnitude more instructions for the same result than std::sort. On the other hand std::sort() code, even if it's faster, it has more branch miss-predictions.
Here [1] are the results if you want to have a look.
[1] https://gist.github.com/b97a6395c9a20c3d4cea
[+] [-] radiospiel|14 years ago|reply
Therefore (a-b) would be a better comparison function, if you guarantee it does not overflow.
[+] [-] infiniteburp|14 years ago|reply
[+] [-] chj|14 years ago|reply
[+] [-] radiospiel|14 years ago|reply
[+] [-] unknown|14 years ago|reply
[deleted]
[+] [-] livebeef|14 years ago|reply
[+] [-] VMG|14 years ago|reply
[+] [-] psyren|14 years ago|reply
[+] [-] KonradKlause|14 years ago|reply
"I am working as a freelance software architect, IT consultant, ..."
[+] [-] radiospiel|14 years ago|reply