> And lastly, this isn’t meant to be a prescriptive post,
> but we all know why micro-optimizing is usually a
> mistake: it wastes your time, it’s easy to screw up
> (see question 5), and it *typically* produces no
> measurable speedup.
I emphasized the typically because this is a key idea: you are fallible -- deal with it!
I visited a local hacker-space with a friend of mine to talk about a project of ours. A bunch of those guys are NASA and they know quite a bit about how to engineer for reliability. Since our project involved a lot of soldering, he suggested that we estimate our average solder join failure rate and use that as the number of expected failures to look for. That's what he does in his personal projects.
I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh! I know I'm not perfect and I'm smart enough to have deduced that I have something like an average failure rate for solder joins. So why have I been proceeding for years and years as if I could just do everything perfectly if I just tried hard enough?
Every new function you write is a risk. Minimize your risks. Maximize the cost/benefit. Every routine you optimize is a risk. Minimize your risks. Maximize cost/benefit.
One's problems as a programmer are more likely the result of excess hubris and not insufficient smarts.
EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your own optimization problem, and the optimal solution is almost never a no-brainer shove of a lever all the way in some direction.
That's awesome. I'm always impressed by the reliability work that comes out of NASA. I don't regret a single penny of my tax dollars which goes to them just for this sort of thing alone.
ridiculousfish is the smartest programmer I've ever met. I hired him at Apple years ago, and while I could regale you with tales of how awesome he is, Apple would probably sue me.
That said, if you've ever played World of Warcraft on the Mac, you should worship him as a god.
Sounds like the kind of guy I want to work with on a daily basis even if half the time he spent yelling at me for doing something stupid. After a few months I'd be such a better programmer.
This dude is pretty much my dream cofounder. Math background, aesthetically intuitive, intellectually curious, specifically interested in psychology, and clearly a talented programmer.
I avoid these types of optimizations, say, 97% of the time
This reminded me of Linus' blog post on optimizing SHA1 routines in Git [0]. FTA: "Some people seem to think that C is a real programming language, but they are sadly mistaken. It really is about writing almost-portable assembly language, and it turns out that getting good results from SHA1 really is mostly about trying to fight the compilers tendency to try to be clever."
There is a self tuning linear algebra library called ATLAS. When building, probes various things about your computer, things like L1/L2 cache, if you have fused multiply/add, etc. It uses these to output very highly tuned matrix kernels in standard C.
Ironically, when compiling this C code, you have to use -O, higher optimization levels make the compiler try to be too smart for its own good and actually perform worse.
"Improving the Performance of the Secure Hash Algorithm" (by transforming the algorithm mathematically and then using hand-written assembler with SIMD instructions):
A long, long time ago, I recall that someone did a study of code generated by C compilers and found that, on average, 1 C statement = 2.5 lines of assembly. (for CISC)
GCC does this optimization, because strlen is a "built-in function:" one that gcc recognizes and optimizes specially. Disabling built-ins with -fno-builtin defeats this optimization."
No, it does it because strlen is declared to be a pure function, and it is smart enough to realize you are not modifying anything in s. -fno-builtin does not change the generated code, it still calls the strlen function (at least on my gcc), but it is hoisted out of the loop.
Gcc surprised me once. I had put some effort into replacing something like (x << 12) | (x>>20) with the appropriate roll instruction at the assembly level. Only after seeing that the "optimized" code ran no faster than plain GCC, I looked at the GCC output and of course the roll was there.
I believe it was GCC 3.*, so even older than the one mentioned in the article.
Ha, I really like this guy. I started reading another one of his blog entries (a description of how compilers often optimise a divide by a constant where, instead of issuing an expensive divide, it multiplies by a "magic" number and then right shifts the result). After hitting lots of complex (for me anyway) mathematical proofs, my brain tuned out slightly until I saw
"Did you skip ahead? It's OK. Here's the summary"
I like to think he deliberately did that for hungover-yet-still-interested devs like myself.
Well, I never read proofs on the first pass. I mean, I read each expression, but I don't bother trying to assimilate them into a coherent thought until I've finished any plain-English associated text.
And even that I'll sometimes wait to properly comprehend for a second pass.
Question number 2 really surprised me. What if the layout of memory is:
(s)(s+1)(s+2)...(result)
If `s` has no null bytes until `result`, then `strlen(s)` will terminate on `result`. However, if you change `result` while iterating over `s`, then `result` will be not null and the `strlen()` result will change. Or is there some rule in C standard that says local variables are not in the same memory range as external pointers?
Certain pointer operations may result in undefined behavior:
int main()
{
int arr[4] = {0, 1, 2, 3};
int * p = arr + 5; // undefined behavior
}
In short, you can't assume that result will be placed after the array. What if result is optimized to a register? It actually sounds quite likely to me given the tight loop.
Since "result" is on the stack it will only be allocated at the time of the function call. Also note that the stack grows downwards in memory. To get the memory layout you've shown you'd have to be summing up the bytes in the unallocated memory between the heap and the top of the stack. Probably this is explained somehow in the standard but I don't personally know.
If you've managed to embed an integer into your char array and you are using that integer as an accumulator, you're probably "doing it wrong", very wrong. More to the point, you've got bigger problems than gcc's optimization specifics.
As for this particular example, result is allocated on the stack and is guaranteed to not lie within s, so the gcc optimization is perfectly safe.
By any chance, does this sort of information exist at all for Java bytecode compilers? Whenever I write Java code I try to do things the Java way but it usually makes me cringe to think that things like accessing a boolean field requires a function call.
I console myself by thinking that this all gets optimized away by the JIT compiler, but I'd really like to know for sure.
I think the fact that I got the first one wrong made me err too far into thinking GCC was smarter than I had first assumed. And it is, but only for certain things.
Does 'const' mean that the function is not allowed to change the contents of the string, or does it mean that the contents of the string will never change? Suppose the 'sum' function was running in one thread, while in another the string that was passed in as 's' is being continuously written to. While this may not make much sense to do, and the output would not have a reliable value, this optimization would be the wrong thing to do.
const only means that the function is not allowed to change memory through that reference. It says nothing about aliases to the same memory.
But the "what if another thread changes the string" issue is a red herring. The question to keep in mind when optimizing in a multi-threaded context is "does the output of the compiler generated code correspond to at least one possible interleaving of threads". Since this thread doesn't do any synchronization, it's possible that all the function executes without being interrupted by another thread, so this is a sound optimization.
In a modern CPU, instruction selection is much more complicated than, "Oh, addl (or sll) is faster than umul," and depends on the micro-architecture/pipeline, surrounding code, etc. Addition usually takes one cycle, while multiplication might take several cycles (although, by pipelining, you can generally execute a multiply per cycle per multiplier) But, if all the other adders are full and the result isn't needed for a few cycles, it might be fastest to use the multiplier. (I'm thinking of a statically scheduled pipeline. Things get harder when the CPU does dynamic instruction reordering.)
Addition and bitshifting are easier to implement in hardware than multiplication is, with less complicated circuits that can return results in fewer clock cycles (though the difference is much less than it used to be).
Nice post. This is how the Nullstone compiler optimization benchmark suite is organized: two functionally equivalent versions of a C program, one cleaner/suboptimal, the other optimized by hand. The goal is to have the generated code perform the same in either case.
I got almost all of it wrong. I'm not very familiar with gcc's workings and usually just figure that the compiler is going to be dumb and fail at everything but the simplest optimizations. But then, when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.
The point of most of the optimizations mentioned in the article is not to make programs run faster; e.g. any competent programmer can hoist constant functions out of a for loop himself, but knowing that the compiler will do this allows you to write cleaner code.
(This does not apply to e.g. auto-vectorization, which replaces 'x[0] = y[0] * z[0]; x[1] = y[1] * z[1]; ...' by a single (SSE?) instructions which calculates all of that at once. Such instructions are much faster and would require dropping down to assembly without a good optimizer.)
when I actually go to do the tedious and source-obfuscating micro stuff I usually check my assumptions first - you kind of have to anyway, once you go that deep.
Absolutely. Fun to read, made me smarter, but half of it could be different in any other version of gcc.
And frankly, if these are your problems, you're doing pretty well. I worked on a product that had some numerical parts, which were sometimes the limiting factor in performance, though usually not. So one day I was talking to a guy who wrote the numerical code about why regression testing was difficult and subtle, and it was revealed that our numerical results could differ from build to build. I.e., a single build of the product always produced the same results, but if you changed the floating point code, the new build might produce different (though equally accurate) results, even if semantically the program was supposed to perform the same C math operations on the same data.
Which makes perfect sense if you're using the old stack-based x87-style floating point instructions. We couldn't be... not in the 21st century on Opterons... check Makefile and gcc man page... biggest facepalm of my career.
We were able to announce a nice speedup on some workloads in the next release.
He could have added the switch statement optimization. That's my favorite. GCC 4 can optimize switch statements with sequential keys e.g. (1,2,3,4 ...) into a jump table. This was AFAIK introduced in GCC 4
#2 was an error I had, and a co-worker told me screaming that I was a crappy programmer because I did it, now that I know it's optimized I feel like he owns me an apology or maybe not :-)
[+] [-] stcredzero|15 years ago|reply
I visited a local hacker-space with a friend of mine to talk about a project of ours. A bunch of those guys are NASA and they know quite a bit about how to engineer for reliability. Since our project involved a lot of soldering, he suggested that we estimate our average solder join failure rate and use that as the number of expected failures to look for. That's what he does in his personal projects.
I exclaimed that was brilliant, then was a little embarrassed. Of course! Duh! I know I'm not perfect and I'm smart enough to have deduced that I have something like an average failure rate for solder joins. So why have I been proceeding for years and years as if I could just do everything perfectly if I just tried hard enough?
Every new function you write is a risk. Minimize your risks. Maximize the cost/benefit. Every routine you optimize is a risk. Minimize your risks. Maximize cost/benefit.
One's problems as a programmer are more likely the result of excess hubris and not insufficient smarts.
EDIT: Fear is as big a problem as hubris. Walk the middle path. You are your own optimization problem, and the optimal solution is almost never a no-brainer shove of a lever all the way in some direction.
[+] [-] flogic|15 years ago|reply
[+] [-] wooster|15 years ago|reply
That said, if you've ever played World of Warcraft on the Mac, you should worship him as a god.
:-)
[+] [-] mark_h|15 years ago|reply
His blog is that perfect combination of really meaty articles that are published infrequently enough to really be savoured.
[+] [-] wheaties|15 years ago|reply
[+] [-] zackattack|15 years ago|reply
http://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Cor...
[+] [-] jakevoytko|15 years ago|reply
This reminded me of Linus' blog post on optimizing SHA1 routines in Git [0]. FTA: "Some people seem to think that C is a real programming language, but they are sadly mistaken. It really is about writing almost-portable assembly language, and it turns out that getting good results from SHA1 really is mostly about trying to fight the compilers tendency to try to be clever."
[0] http://torvalds-family.blogspot.com/2009/08/programming.html
[+] [-] grogers|15 years ago|reply
Ironically, when compiling this C code, you have to use -O, higher optimization levels make the compiler try to be too smart for its own good and actually perform worse.
[+] [-] briansmith|15 years ago|reply
http://software.intel.com/en-us/articles/improving-the-perfo...
[+] [-] stcredzero|15 years ago|reply
[+] [-] igorgue|15 years ago|reply
[+] [-] grogers|15 years ago|reply
GCC does this optimization, because strlen is a "built-in function:" one that gcc recognizes and optimizes specially. Disabling built-ins with -fno-builtin defeats this optimization."
No, it does it because strlen is declared to be a pure function, and it is smart enough to realize you are not modifying anything in s. -fno-builtin does not change the generated code, it still calls the strlen function (at least on my gcc), but it is hoisted out of the loop.
[+] [-] astrange|15 years ago|reply
[+] [-] gfodor|15 years ago|reply
[+] [-] praptak|15 years ago|reply
I believe it was GCC 3.*, so even older than the one mentioned in the article.
[+] [-] smcl|15 years ago|reply
"Did you skip ahead? It's OK. Here's the summary"
I like to think he deliberately did that for hungover-yet-still-interested devs like myself.
[+] [-] sprout|15 years ago|reply
And even that I'll sometimes wait to properly comprehend for a second pass.
[+] [-] viraptor|15 years ago|reply
[+] [-] ramchip|15 years ago|reply
[+] [-] notaddicted|15 years ago|reply
[+] [-] InclinedPlane|15 years ago|reply
As for this particular example, result is allocated on the stack and is guaranteed to not lie within s, so the gcc optimization is perfectly safe.
[+] [-] drblast|15 years ago|reply
By any chance, does this sort of information exist at all for Java bytecode compilers? Whenever I write Java code I try to do things the Java way but it usually makes me cringe to think that things like accessing a boolean field requires a function call.
I console myself by thinking that this all gets optimized away by the JIT compiler, but I'd really like to know for sure.
[+] [-] tumult|15 years ago|reply
I think the fact that I got the first one wrong made me err too far into thinking GCC was smarter than I had first assumed. And it is, but only for certain things.
[+] [-] afhof|15 years ago|reply
Does 'const' mean that the function is not allowed to change the contents of the string, or does it mean that the contents of the string will never change? Suppose the 'sum' function was running in one thread, while in another the string that was passed in as 's' is being continuously written to. While this may not make much sense to do, and the output would not have a reliable value, this optimization would be the wrong thing to do.
[+] [-] okmjuhb|15 years ago|reply
But the "what if another thread changes the string" issue is a red herring. The question to keep in mind when optimizing in a multi-threaded context is "does the output of the compiler generated code correspond to at least one possible interleaving of threads". Since this thread doesn't do any synchronization, it's possible that all the function executes without being interrupted by another thread, so this is a sound optimization.
[+] [-] murrayh|15 years ago|reply
[+] [-] albertzeyer|15 years ago|reply
[+] [-] mian2zi3|15 years ago|reply
[+] [-] _delirium|15 years ago|reply
[+] [-] rquirk|15 years ago|reply
[+] [-] mian2zi3|15 years ago|reply
[+] [-] nitrogen|15 years ago|reply
[+] [-] astrange|15 years ago|reply
[+] [-] chipsy|15 years ago|reply
[+] [-] JoachimSchipper|15 years ago|reply
(This does not apply to e.g. auto-vectorization, which replaces 'x[0] = y[0] * z[0]; x[1] = y[1] * z[1]; ...' by a single (SSE?) instructions which calculates all of that at once. Such instructions are much faster and would require dropping down to assembly without a good optimizer.)
[+] [-] dkarl|15 years ago|reply
Absolutely. Fun to read, made me smarter, but half of it could be different in any other version of gcc.
And frankly, if these are your problems, you're doing pretty well. I worked on a product that had some numerical parts, which were sometimes the limiting factor in performance, though usually not. So one day I was talking to a guy who wrote the numerical code about why regression testing was difficult and subtle, and it was revealed that our numerical results could differ from build to build. I.e., a single build of the product always produced the same results, but if you changed the floating point code, the new build might produce different (though equally accurate) results, even if semantically the program was supposed to perform the same C math operations on the same data.
Which makes perfect sense if you're using the old stack-based x87-style floating point instructions. We couldn't be... not in the 21st century on Opterons... check Makefile and gcc man page... biggest facepalm of my career.
We were able to announce a nice speedup on some workloads in the next release.
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] robryan|15 years ago|reply
[+] [-] pquerna|15 years ago|reply
[+] [-] gwern|15 years ago|reply
[+] [-] Aegean|15 years ago|reply
[+] [-] ramchip|15 years ago|reply
[+] [-] igorgue|15 years ago|reply
[+] [-] yuvi|15 years ago|reply
I wouldn't be surprised if e.g. MSVC didn't perform that optimization since it doesn't have the pure function attribute.
[+] [-] losethos|15 years ago|reply
[deleted]