top | item 1540567

Will it optimize?

357 points| mark_h | 15 years ago |ridiculousfish.com | reply

59 comments

order
[+] stcredzero|15 years ago|reply

    > And lastly, this isn’t meant to be a prescriptive post,
    > but we all know why micro-optimizing is usually a 
    > mistake: it wastes your time, it’s easy to screw up 
    > (see question 5), and it *typically* produces no 
    > measurable speedup.
I emphasized the typically because this is a key idea: you are fallible -- deal with it!

I visited a local hacker-space with a friend of mine to talk about a project of ours. A bunch of those guys are NASA and they know quite a bit about how to engineer for reliability. Since our project involved a lot of soldering, he suggested that we estimate our average solder join failure rate and use that as the number of expected failures to look for. That's what he does in his personal projects.

I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh! I know I'm not perfect and I'm smart enough to have deduced that I have something like an average failure rate for solder joins. So why have I been proceeding for years and years as if I could just do everything perfectly if I just tried hard enough?

Every new function you write is a risk. Minimize your risks. Maximize the cost/benefit. Every routine you optimize is a risk. Minimize your risks. Maximize cost/benefit.

One's problems as a programmer are more likely the result of excess hubris and not insufficient smarts.

EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your own optimization problem, and the optimal solution is almost never a no-brainer shove of a lever all the way in some direction.

[+] flogic|15 years ago|reply
That's awesome. I'm always impressed by the reliability work that comes out of NASA. I don't regret a single penny of my tax dollars which goes to them just for this sort of thing alone.
[+] wooster|15 years ago|reply
ridiculousfish is the smartest programmer I've ever met. I hired him at Apple years ago, and while I could regale you with tales of how awesome he is, Apple would probably sue me.

That said, if you've ever played World of Warcraft on the Mac, you should worship him as a god.

:-)

[+] mark_h|15 years ago|reply
If you ever find a way to regale us without being sued, I am all ears :)

His blog is that perfect combination of really meaty articles that are published infrequently enough to really be savoured.

[+] wheaties|15 years ago|reply
Sounds like the kind of guy I want to work with on a daily basis even if half the time he spent yelling at me for doing something stupid. After a few months I'd be such a better programmer.
[+] jakevoytko|15 years ago|reply
I avoid these types of optimizations, say, 97% of the time

This reminded me of Linus' blog post on optimizing SHA1 routines in Git [0]. FTA: "Some people seem to think that C is a real programming language, but they are sadly mistaken. It really is about writing almost-portable assembly language, and it turns out that getting good results from SHA1 really is mostly about trying to fight the compilers tendency to try to be clever."

[0] http://torvalds-family.blogspot.com/2009/08/programming.html

[+] grogers|15 years ago|reply
There is a self tuning linear algebra library called ATLAS. When building, probes various things about your computer, things like L1/L2 cache, if you have fused multiply/add, etc. It uses these to output very highly tuned matrix kernels in standard C.

Ironically, when compiling this C code, you have to use -O, higher optimization levels make the compiler try to be too smart for its own good and actually perform worse.

[+] stcredzero|15 years ago|reply
A long, long time ago, I recall that someone did a study of code generated by C compilers and found that, on average, 1 C statement = 2.5 lines of assembly. (for CISC)
[+] igorgue|15 years ago|reply
SHA1 is still pretty damn fast in Git, I couldn't be able to get an even close speed with my own implementation.
[+] grogers|15 years ago|reply
"2. Loop-invariant strlen()

GCC does this optimization, because strlen is a "built-in function:" one that gcc recognizes and optimizes specially. Disabling built-ins with -fno-builtin defeats this optimization."

No, it does it because strlen is declared to be a pure function, and it is smart enough to realize you are not modifying anything in s. -fno-builtin does not change the generated code, it still calls the strlen function (at least on my gcc), but it is hoisted out of the loop.

[+] astrange|15 years ago|reply
strlen() is not declared pure in the OS X headers.
[+] gfodor|15 years ago|reply
If this is true it makes me happy because this is surely less of a hack.
[+] praptak|15 years ago|reply
Gcc surprised me once. I had put some effort into replacing something like (x << 12) | (x>>20) with the appropriate roll instruction at the assembly level. Only after seeing that the "optimized" code ran no faster than plain GCC, I looked at the GCC output and of course the roll was there.

I believe it was GCC 3.*, so even older than the one mentioned in the article.

[+] smcl|15 years ago|reply
Ha, I really like this guy. I started reading another one of his blog entries (a description of how compilers often optimise a divide by a constant where, instead of issuing an expensive divide, it multiplies by a "magic" number and then right shifts the result). After hitting lots of complex (for me anyway) mathematical proofs, my brain tuned out slightly until I saw

"Did you skip ahead? It's OK. Here's the summary"

I like to think he deliberately did that for hungover-yet-still-interested devs like myself.

[+] sprout|15 years ago|reply
Well, I never read proofs on the first pass. I mean, I read each expression, but I don't bother trying to assimilate them into a coherent thought until I've finished any plain-English associated text.

And even that I'll sometimes wait to properly comprehend for a second pass.

[+] viraptor|15 years ago|reply
Question number 2 really surprised me. What if the layout of memory is:

    (s)(s+1)(s+2)...(result)
If `s` has no null bytes until `result`, then `strlen(s)` will terminate on `result`. However, if you change `result` while iterating over `s`, then `result` will be not null and the `strlen()` result will change. Or is there some rule in C standard that says local variables are not in the same memory range as external pointers?
[+] ramchip|15 years ago|reply
Attempting to read s[i] when i is outside the bounds of the array is undefined behaviour, so the compiler is free to ignore this case. See example at http://en.wikipedia.org/wiki/Undefined_behavior#Examples_in_...

     Certain pointer operations may result in undefined behavior:
    int main()
    {
      int arr[4] = {0, 1, 2, 3};
      int * p = arr + 5; // undefined behavior
    }
In short, you can't assume that result will be placed after the array. What if result is optimized to a register? It actually sounds quite likely to me given the tight loop.
[+] notaddicted|15 years ago|reply
Since "result" is on the stack it will only be allocated at the time of the function call. Also note that the stack grows downwards in memory. To get the memory layout you've shown you'd have to be summing up the bytes in the unallocated memory between the heap and the top of the stack. Probably this is explained somehow in the standard but I don't personally know.
[+] InclinedPlane|15 years ago|reply
If you've managed to embed an integer into your char array and you are using that integer as an accumulator, you're probably "doing it wrong", very wrong. More to the point, you've got bigger problems than gcc's optimization specifics.

As for this particular example, result is allocated on the stack and is guaranteed to not lie within s, so the gcc optimization is perfectly safe.

[+] drblast|15 years ago|reply
Interesting article.

By any chance, does this sort of information exist at all for Java bytecode compilers? Whenever I write Java code I try to do things the Java way but it usually makes me cringe to think that things like accessing a boolean field requires a function call.

I console myself by thinking that this all gets optimized away by the JIT compiler, but I'd really like to know for sure.

[+] tumult|15 years ago|reply
Got me on the first and last ones. Cool post!

I think the fact that I got the first one wrong made me err too far into thinking GCC was smarter than I had first assumed. And it is, but only for certain things.

[+] afhof|15 years ago|reply
Concerning #2:

Does 'const' mean that the function is not allowed to change the contents of the string, or does it mean that the contents of the string will never change? Suppose the 'sum' function was running in one thread, while in another the string that was passed in as 's' is being continuously written to. While this may not make much sense to do, and the output would not have a reliable value, this optimization would be the wrong thing to do.

[+] okmjuhb|15 years ago|reply
const only means that the function is not allowed to change memory through that reference. It says nothing about aliases to the same memory.

But the "what if another thread changes the string" issue is a red herring. The question to keep in mind when optimizing in a multi-threaded context is "does the output of the compiler generated code correspond to at least one possible interleaving of threads". Since this thread doesn't do any synchronization, it's possible that all the function executes without being interrupted by another thread, so this is a sound optimization.

[+] albertzeyer|15 years ago|reply
`x+x` is faster than `x*2`? Really? Why? And how about `x<<1`?
[+] mian2zi3|15 years ago|reply
In a modern CPU, instruction selection is much more complicated than, "Oh, addl (or sll) is faster than umul," and depends on the micro-architecture/pipeline, surrounding code, etc. Addition usually takes one cycle, while multiplication might take several cycles (although, by pipelining, you can generally execute a multiply per cycle per multiplier) But, if all the other adders are full and the result isn't needed for a few cycles, it might be fastest to use the multiplier. (I'm thinking of a statically scheduled pipeline. Things get harder when the CPU does dynamic instruction reordering.)
[+] _delirium|15 years ago|reply
Addition and bitshifting are easier to implement in hardware than multiplication is, with less complicated circuits that can return results in fewer clock cycles (though the difference is much less than it used to be).
[+] rquirk|15 years ago|reply
gcc 4.4 spitting out ARM code does that. It gives you `mov r0, r0, asl #1` and for thumb `lsl r0, r0, #1` when using -O2 or -Os.
[+] mian2zi3|15 years ago|reply
Nice post. This is how the Nullstone compiler optimization benchmark suite is organized: two functionally equivalent versions of a C program, one cleaner/suboptimal, the other optimized by hand. The goal is to have the generated code perform the same in either case.
[+] nitrogen|15 years ago|reply
Wouldn't it be possible to optimize floating point multiplication by 2.0f (or any power of two) to an addition or subtraction on the exponent?
[+] astrange|15 years ago|reply
Only on architectures with such instructions. x86 doesn't have anything like that (well, maybe in x87), and I don't know of any others that do.
[+] chipsy|15 years ago|reply
I got almost all of it wrong. I'm not very familiar with gcc's workings and usually just figure that the compiler is going to be dumb and fail at everything but the simplest optimizations. But then, when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.
[+] JoachimSchipper|15 years ago|reply
The point of most of the optimizations mentioned in the article is not to make programs run faster; e.g. any competent programmer can hoist constant functions out of a for loop himself, but knowing that the compiler will do this allows you to write cleaner code.

(This does not apply to e.g. auto-vectorization, which replaces 'x[0] = y[0] * z[0]; x[1] = y[1] * z[1]; ...' by a single (SSE?) instructions which calculates all of that at once. Such instructions are much faster and would require dropping down to assembly without a good optimizer.)

[+] dkarl|15 years ago|reply
when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.

Absolutely. Fun to read, made me smarter, but half of it could be different in any other version of gcc.

And frankly, if these are your problems, you're doing pretty well. I worked on a product that had some numerical parts, which were sometimes the limiting factor in performance, though usually not. So one day I was talking to a guy who wrote the numerical code about why regression testing was difficult and subtle, and it was revealed that our numerical results could differ from build to build. I.e., a single build of the product always produced the same results, but if you changed the floating point code, the new build might produce different (though equally accurate) results, even if semantically the program was supposed to perform the same C math operations on the same data.

Which makes perfect sense if you're using the old stack-based x87-style floating point instructions. We couldn't be... not in the 21st century on Opterons... check Makefile and gcc man page... biggest facepalm of my career.

We were able to announce a nice speedup on some workloads in the next release.

[+] robryan|15 years ago|reply
Great article, only really have a basic knowledge of C from uni, but easy to follow along and learn something.
[+] pquerna|15 years ago|reply
great post with a nice quiz format!
[+] gwern|15 years ago|reply
I wish more posts used quizzes; they make it that much harder to lie to yourself.
[+] Aegean|15 years ago|reply
He could have added the switch statement optimization. That's my favorite. GCC 4 can optimize switch statements with sequential keys e.g. (1,2,3,4 ...) into a jump table. This was AFAIK introduced in GCC 4
[+] ramchip|15 years ago|reply
But he did: The switch statement was optimized to a jump table, while the if statements became a long sequence of compares.
[+] igorgue|15 years ago|reply
#2 was an error I had, and a co-worker told me screaming that I was a crappy programmer because I did it, now that I know it's optimized I feel like he owns me an apology or maybe not :-)
[+] yuvi|15 years ago|reply
Were you using gcc?

I wouldn't be surprised if e.g. MSVC didn't perform that optimization since it doesn't have the pure function attribute.