Why is this C++ code faster than my hand-written assembly (2016)

[+] abainbridge|8 years ago|reply

A couple of weeks ago I'd never heard of Peter Cordes. Now the linked article is the third time I've seen his work. He's doing a fine job of fixing Stackoverflow's low-level optimization knowledge. Not so long ago all I seemed to find there was people saying something like, "well, you shouldn't optimize that anyway", or, "modern computers are very complex, don't even try to understand what's happening".

[+] jdcarter|8 years ago|reply

> "modern computers are very complex, don't even try to understand what's happening"

One thing to keep in mind is that the CPU is no longer the CPU you think it is. While the x86 instruction set used to directly control the processor, today it's just a compatibility shim. Under the covers, the CPU is converting your assembly into its own microcode and it's that microcode which is doing the real work. For example, look at some of the block diagrams in section 2 of this Intel manual:

https://software.intel.com/sites/default/files/managed/a4/60...

Combine that with all kinds of other voodoo they've added--fancy caches, branch prediction, and so forth--and the "don't even try" maxim starts to make sense. I'd amend that to say "don't even try unless you're just curious."

Personally, I think assembly is a great learning tool, e.g. for understanding how pointers work, how the stack works vs. the heap, how function calls work, and so-forth. I was exposed to Motorola 68K assembly in college and it completely changed my world--lots of C++ things I considered voodoo suddenly made total sense. But I've never used assembly in production code and probably never will.

[+] mark-r|8 years ago|reply

If I ever see Knuth's quote "premature optimization is the root of all evil" in response to a question again, I think I'll puke. Not only is it hard for outsiders to know what's premature and what isn't, but sometimes it's nice to make a habit of doing things the faster way when you have two choices that are otherwise indistinguishable. For example I try to use ++x instead of x++, even though 99% of the time it makes no difference.

[+] setzer22|8 years ago|reply

> Not so long ago all I seemed to find there was people saying something like, "well, you shouldn't optimize that anyway

I find this "Why would you even do that?" attitude in StackOverflow to be very irritating. And it's not just in low-level optimization questions. StackOverflow should be a place for questions and answers, not a collection of lessons on corporate best practices where curiosity and experimentation is discouraged.

[+] fhood|8 years ago|reply

> "modern computers are very complex, don't even try to understand what's happening"

I live my life by that adage

[+] kazinator|8 years ago|reply

TL; DR: > If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code.

Once (maybe 25 years ago?) I came across a book on assembly language programming for the Macintosh.

The authors wrote a circle-filling graphic routine which internally calculated the integer square root in assembly language, drawing the circle using the y = sqrt(r * r - x * x) formula!

What is more, the accompanying description of the function in book featured sentences that were boasting about how it draws a big circle in a small amount of time (like a "only" quarter of a second or some eternity of that order) because of the blazing speed of assembly language!

How could the authors not have used, say, MacPaint, and not be aware that circles and ellipses can be drawn instantaneously on the same hardware: fast enough for drag-and-drop interactive resizing?

[+] jacquesm|8 years ago|reply

Bresenham's line algorithms and the adaptation of the general principle to circles and arcs are absolute gems. I've used those over and over in the first two decades of my career and I never ceased to be impressed with the elegance and speed.

Surprising that people writing a book 25 years ago would not have been aware of this work.

https://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm

https://en.wikipedia.org/wiki/Midpoint_circle_algorithm

[+] cmiller1|8 years ago|reply

Relevant: https://www.folklore.org/StoryView.py?story=Round_Rects_Are_...

[+] kaslai|8 years ago|reply

I needed to create a circle drawing algorithm for some purposes a long ways back. I was in middle school at the time.

While speed wasn't a huge issue, consistency was. Simply using sin/cos to draw the circle works, but it can lead to jagged and inconsistently placed pixels if your degree step isn't perfect. Even with my middle school level math, I realized I could just iterate over the bounding box for the circle and just use r^2 as a threshold on x^2+y^2 to determine whether a pixel should be on or off - no square root required, since it's totally redundant.

It has the bonuses of using only integer math (consistency!) and only requires slight changes to yield filled or unfilled circles. It's also almost as fast as drawing a simple filled rectangle!

The idea that the author could miss such a simple algorithm baffles me.

[+] payne92|8 years ago|reply

tl;dr -- the asm author used DIV to divide by a constant 2

More fundamentally: it's theoretically possible to at least match compiled code performance with assembly, because you could just write the code the compiler generates.

BUT, it requires a LOT of experience.

Modern compilers "know" a lot of optimizations (e.g. integer mult by fixed constant --> shifts, adds, and subtracts). Avoiding pipeline stalls requires a lot of tedious register bookkeeping, and modern processors have very complicated execution models.

It's almost always better to start with a compiler-generated critical section and see if there are possible hand optimizations.

[+] bradford|8 years ago|reply

In University computer architecture courses, we were challenged to write a quicksort routine in assembly. We were also asked to compare the assembly that we authored with assembly that was compiled from C++ (after we authored our own solutions, of course).

It was an amazing crash-course on just how good compilers have become at optimizing. Not a single student could hand craft assembly that was faster than the compiler output. The teacher of the course was able to generate assembly that was slightly faster, and he stated that in order to do so, he had to greatly exploit his in-depth knowledge of the processor's pipeline system. That was roughly year 2000, and I'm sure compilers have only become better at their job since then.

All in all, excellent learning experience. I've since encountered several instances where developers assert superior assembly skills, and by default I'm silently skeptical of their claims.

[+] krylon|8 years ago|reply

In a prior job I worked as a C programmer, and once I tried to rewrite a few heavily used utility function in assembly.

I measured, but it did not run any faster. Damnit, this is assembly, I said, it has to be faster. So I looked at the assembly code generated by the compiler: Turned out it was pretty much identical to the code I had written. At that point, I felt brief surge of pride (because I was as clever as the compiler) and then disappointment (because I was not more clever than the compiler), and I figured trying to be smarter than the compiler was a waste of my time.

[+] ktRolster|8 years ago|reply

It's almost always better to start with a compiler-generated critical section and see if there are possible hand optimizations.

Yeap. But once you do, it's almost always easy to get some efficiency gains, because you understand what the code intends, and the compiler does not.

Surprisingly it doesn't take much experience to do this: just a profiler, and a willingness to try different things until you find something that works.

[+] copperred|8 years ago|reply

Your use of the phrase "critical section" threw me off for a second. All good now.

[+] 2ton_jeff|8 years ago|reply

Compelled to chime in here and say that you can just as easily write bad HLL that the compiler can't do much with in the first place.

Also back in 2016, I participated in a pseudo-challenge on the fasm board [1] and it is trivial to optimise both the C/C++ as well as hand-written assembly. IMO comparing how good a compiler is at optimising is akin to how good it is at figuring out your intentions (and they all suck at that).

1: https://board.flatassembler.net/topic.php?t=19103

[+] heavenlyblue|8 years ago|reply

The issue is that division by a constant of 2, when it's highly optimisable - must already be embodied in the algorithm on a higher level.

Essentially you must have a piece of code where that constant is just a parameter to the algorithm that suddenly when set to a value of 2 makes it that much more efficient.

Now here's the bummer: if you had already planned for it, then writing custom assembly for it is not going to make the algorithm that much faster. But when you didn't - you're rewriting the whole of the algorithm anyway.

[+] kutkloon7|8 years ago|reply

Nice post. It really surprises me that someone who writes assembly for fun doesn't know that dividing by constants can be done faster, nor that dividing/multiplying by powers of two can be done even faster.

On the topic of hand-optimizing or not: There always seem to be two camps. The first say: "Never bother with assembly. Let the compiler do optimizing. If anything, try to write code that the compiler can optimize.".

The second camp always states something like "It is not that hard to outperform the compiler.".

I guess the complicated pipeline models are mostly the reason for the complexity. I'm not sure how much compilers do to avoid pipeline stalls though. For example, how detailed the execution model that they use is. It shouldn't be too complicated, since most compilers (as far as I know) don't differentiate between different processors, while different processors have a different execution model.

[+] bluedino|8 years ago|reply

>> Have you examined the assembly code that GCC generates for your C++ program?

A very polite way of saying, "why are you even using assembly, when you don't understand assembly?"

[+] eterm|8 years ago|reply

And why wouldn't they be polite?

The OP clearly does understand assembly enough to start doing project Euler type problems, which is a good way to learn basic programming in any language. They get a solution in assembler which is more than many people here would be able to do I suspect.

And they're looking to expand their knowledge by asking on stack-overflow about something they don't understand.

So why do you think they should they be met with rudeness and hostility?

[+] AdmiralAsshat|8 years ago|reply

The question was more interesting than the answer.

tl;dr version--the author's hand-written assembly was poor.

I guess the more interesting takeaway is "Just because it's assembly doesn't mean it's good assembly."

[+] ericfrederich|8 years ago|reply

For fun I ported the C++ to Python and Cython without any kind of mathematical or programmatic optimizations. C++ was 0.5 seconds, then Python was 5.0 seconds. Cython, which was the same exact code as Python except sprinkled with "cdef long" to declare C types, was just 0.7 seconds.

[+] SeanDav|8 years ago|reply

General comment and not aimed at this specific instance:

Just because you are writing in assembler, does not mean it is going to run faster than the same code in a compiled language. There has been decades of research and who knows how many man-years of effort that has gone into producing efficient compiled code from C, C++, Fortran etc.

Your assembly skills have to be of quite a decent order to beat a modern compiler.

BTW: The answer to the question on Stack Overflow by Peter Cordes is a must-read. Brilliant.

[+] Veedrac|8 years ago|reply

Quite decent is perhaps a bit pessimistic. Yes, you need to be fairly fluent if you want to write hand-optimised vector code. On the other hand, you only need to know a little assembly to be able to write C code that compiles better; that compilers are frequently not stupid doesn't mean the level below has nothing to teach.

I personally find it a shame that there are so many juicy instructions that compilers have no hope of ever using effectively. How often is a compiler smart enough to solve a problem with PDEP/PEXT, for example? Those functions are versatile as heck, but you need to plan for them if you want them to show up.

[+] 15155|8 years ago|reply

> Your assembly skills have to be of quite a decent order to beat a modern compiler.

Start with compiler output, add intrinsics where you can see help is needed, benchmark, repeat.

Pathologically-slow ASM is pretty rare from modern compilers in my experience.

[+] acdha|8 years ago|reply

There's a key corollary: you also either have to be doing this as a one-off or committed to maintaining it over time as hardware changes and those cool tricks become moot or even de-optimizations.

[+] iamjk|8 years ago|reply

The people who write "article answers" like this on SO are the real MVP's of the web.

[+] raphlinus|8 years ago|reply

Apologies if this is somewhat off-topic for the thread, but I suspect this will be a fun puzzle for fans of low-level optimization. The theme is "optimized fizzbuzz".

The classic fizzbuzz will use %3 and %5 operations to test divisibility. As we know from the same source as OP, these are horrifically slow. In addition, the usual approach to fizzbuzz has an annoying duplication, either of the strings or of the predicates.

So, the challenge is, write an optimized fizzbuzz with the following properties: the state for the divisibility testing is a function with a period of 15, which can be calculated in 2 C operations. There are 3 tests for printing, each of the form 'if (...) printf("...");' where each if test is one C operation.

Good luck and have fun!

[+] pcordes|8 years ago|reply

One common trick is to handle the %3 and %5 with down-counters that you reset to 3 or 5 when they reach zero. Or better, unroll the loop by 3 so you only need one down-counter.

(Also, no, `%5` isn't "horrifically" slow when it's a compile-time-constant modulus: you or the compiler can do it with a multiply and shift for the integer division, and then x%5 = x - (x/5 * 5). https://godbolt.org/g/3HwBrF. It still sucks though.)

I wrote an x86 assembly FizzBuzz for fun a while ago, intended to be an example of how to optimize (https://stackoverflow.com/a/37494090/224132). Some parts of it kind of suck, though. I made some parts efficient (like appending "buzz\n" to a buffer with a mov [rdx], r13 / add rdx, 5), but left in too many conditional branches and a function call instead of inlining everything when unrolling.

I'm glad so many people like my post that the OP linked. It was fun to write. :)

[+] showdead|8 years ago|reply

Well, there is this approach:

    #include <stdio.h>
    #include <stdint.h>
    
    #define FIZZ(x)         (x & 0x0924)
    #define BUZZ(x)         (x & 0x0210)
    #define FIZZBUZZ(x)     (x & 0x4000)
    #define NONE(x)         (x & 0x34cb)
    
    void fizzbuzz(size_t n_max)
    {
        uint32_t x = 1;
        for (size_t i = 1; i <= n_max; i++)
        {
            if (FIZZ(x))
              printf("fizz\n");
            if (BUZZ(x))
              printf("buzz\n");
            if (FIZZBUZZ(x))
              printf("fizzbuzz\n");
            if (NONE(x))
              printf("%d\n", i);
            x <<= 1;
            x |= (x >> 15);
        }
    }
    
    int main(int argc, char **argv)
    {
        fizzbuzz(100);
    }

It's less costly than integer division, and I'm not sure it's what you had in mind. The output here still duplicates the strings.

[+] mikeash|8 years ago|reply

Good nerdsnipe.

I decided to reject your conditions and replace them with my own, for no apparent reason. Mine has some of the redundancy you wish to eliminate, but the loop body is completely branchless, aside from the call to printf, of course, which we're apparently ignoring.

    #include <stdio.h>
    
    #define NUM "\x06" "%zu\n\0"
    #define FIZZ "\x07" "Fizz\n\0"
    #define BUZZ "\x07" "Buzz\n\0"
    
    const char *x =
        NUM
        NUM
        FIZZ
        NUM
        BUZZ
        FIZZ
        NUM
        NUM
        FIZZ
        BUZZ
        NUM
        FIZZ
        NUM
        NUM
        "\xa6" "FizzBuzz\n";
    
    void fizzbuzz(size_t upTo) {
        for(size_t i = 1; i <= upTo; i++) {
            printf(x + 1, i);
            x += *x;
        }
    }
    
    int main(int argc, char **argv) {
        fizzbuzz(100);
    }

[+] rbehrends|8 years ago|reply

Let's optimize the if statements away, too (no extra memory accesses or branches, either).

  #include <stdio.h>
  #include <stdint.h>
  
  #define F(n) (1 << (2*(n)-2))
  #define B(n) (1 << (2*(n)-1))
  
  int main() {
    uint32_t m = F(3) | F(6) | F(9) | F(12) | F(15) | B(5) | B(10) | B(15);
    int i;
    static char t[4][16]= { "%d\n", "Fizz\n", "Buzz\n", "FizzBuzz\n" };
    for (i = 1; i <= 100; i++) {
      printf(t[m&3], i);
      m = (m >> 2) | (m << 28);
    }
    return 0;
  }

Transforming that into an if-based version is left as an exercise for the reader. :)

[+] ghettoimp|8 years ago|reply

This probably wouldn't be a very good optimization puzzle as the cost of printing would so dominate the arithmetic operations. :(

[+] bjoli|8 years ago|reply

I know is it is not the point of the question, but that problem would benefit greatly from memoization. Calculate it recursively and memoize the result of every step. With all the neat trickery that they are doing with assembly they could easily go sub 10ms.

I whipped together a short poc in chezscheme, and it clocks in at about 50ms on my 4 yo laptop.

[+] elcapitan|8 years ago|reply

tldr: compiler replaces /2 with a shift.

[+] msimpson|8 years ago|reply

> If you think a 64-bit DIV instruction is a good way to divide by two, then no wonder the compiler's asm output beat your hand-written code...

Compilers employ multitudes of optimizations that will go overlooked in hand-written ASM unless you, as the author, are very knowledgeable. End of story.

[+] coldcode|8 years ago|reply

When I started programming on a Apple II+ assembly was important. Today there are likely only a few people in the world who truly understand what any particular CPU family is actually doing sufficiently to beat the compiler in some cases, and they probably are the ones writing the optimizer. But 6502 was fun to code for and the tricks were mighty clever but you could understand them.

[+] abainbridge|8 years ago|reply

> Today there are likely only a few people in the world who truly understand what any particular CPU family is actually doing sufficiently to beat the compiler

I'm not sure that's true. There are hundreds of compilers in the world, being maintained by thousands of developers. Then there are all the JITs. And the people who make the standard library implementations. And performance critical stuff in game engines, OS kernels, hardware drivers, high-frequency trading. Then there's the embedded space. And then there's Fabrice Bellard.

My previous employer, Cambridge Silicon Radio (one of hundreds of similar companies nobody's heard of) had dozens of people on the staff that worked on this kind of thing. I have friends at ARM, Broadcom, Samsung and Raspberry Pi that mess around with processor designs for a living. This is just my little experience of the industry. There are armies of these people.

[+] mark-r|8 years ago|reply

I had the same kind of fun with the 8080/Z80. My favorite was the conditional function return instruction, it was only 1 byte instead of 3 bytes for a conditional jump.

[+] takeda|8 years ago|reply

Not too surprising answer: "your assembly sucks"

[+] m3kw9|8 years ago|reply

Because the complier has optimized it better than you.

[+] smegel|8 years ago|reply

> but I don't see many ways to optimize my assembly solution further

I can't do it therefore it must be impossible!

[+] otempomores|8 years ago|reply

[deleted]

[+] hathym|8 years ago|reply

[deleted]

[+] dang|8 years ago|reply

Would you please not post unsubstantive comments, and especially not flamebait, to HN?

https://news.ycombinator.com/newsguidelines.html

We detached this subthread from https://news.ycombinator.com/item?id=15072499 and marked it off-topic.

[+] bluGill|8 years ago|reply

I'm a C++ developer in an embedded system where speed counts. I still live by that badge. I know that if I find a particular tiny area of code that is a bottleneck I can drop to low level assembly and after a year of work beat the compiler by a tiny amount on ONE of the THREE CPUs we support - but it would take me an entire year of nothing else.

I know of thousands of places where our code does not use the best algorithm (they are not bottlenecks). I know of hundreds of real bottlenecks that I think could be fixed with algorithm work, but the bottleneck isn't large enough to be worth putting effort into. (this decision is subject to change)

[+] Waterluvian|8 years ago|reply

Standing on the shoulders of those who came before so that we may focus on solving real problems ;)

I tease. I'm also a pretty big Python/JS developer in the robotics industry and I definitely live, to an extent, by that adage. Intentionally so, as I only have so much time in a day and optimization would be a sub-optimal use of company resources. Admittedly I have a comfortable baseline understanding of how computers work, but I do mentally map certain things as black boxes.

[+] fhood|8 years ago|reply

Haha, nope, C++ and embedded C at the moment.

[+] unknown|8 years ago|reply

[deleted]

[+] barrkel|8 years ago|reply

This was a borderline help vampire question, but it ended up working out well, probably for nerd-sniping reasons.

[+] propublica|8 years ago|reply

[deleted]

188 comments