So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.
In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.
> In my experience, loop unrolling should basically never be done except in extremely degenerate cases
Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.
For example:
float identity[4][4];
for (unsigned y = 0; y < 4; y++)
for (unsigned x = 0; x < 4; x++)
identity[y][x] = y == x ? 1 : 0;
... do some matrix math ...
In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.
(You might ask "who would write this code?" As Schemers say: "macros do.")
In addition to the optimisations already mentioned, loop unrolling also typically enables vectorisation in compilers. You might argue that for vectorisation it is not exactly necessary to have the relevant oerations next to each other in a continuous instruction stream, but it makes the vectorisation pass a lot nicer and simpler (if it can be called that to begin with).
In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil.
Did you read TFA? The author did that (though using GCC), and the reason the optimizer does what you see is undefined behavior due to signed integer overflow.
With all the optimisations being implemented in compilers today, it is impressive to see how this opportunity to optimise is missed. Put differently, compiler writers bother about optimisations that gain 0.1% performance in some special cases, but others that could gain 20% performance are not implemented.
Why? Is this optimisation particularly difficult to implement? Or is it just missed low-hanging fruit? It sure looks easy (like: rearrange expressions to keep the expression tree shallow and left-branching to avoid stack operations).
Compiler developers have tons of benchmarks which they run. I’d bet that this is as simple as not being significant in their test suite, with a good chance that it’s both not as simple as it might seem or that there are impacts on more complicated code which is in their benchmark suite or a big customer’s app.
The truth is that the hotspot computer is pretty old at this point and never really implemented a lot of good, robust, and thorough optimizations (I've read the source every year or two).
It does some stuff and hopes for the best.
This is why there is a real commercial jvm market with azul.
It's possible that they're working in the frame of mind that there aren't any low-hanging fruit left after so many years of compiler optimizations and forget to even try.
It is a good answer, but my favorite by far is an answer about branch prediction to explain why processing a sorted array is faster than unsorted: https://stackoverflow.com/q/11227809/938695
I thought at first this was because integer squaring is potentially faster than general integer multiplication and the compiler wasn't seeing the square operation in the second case, but that's not the explanation here.
Ideologically, yes, the compiler should generate the fastest code possible for the same math expression.
However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes. Back when C was young, this was frequently the case (1970's and 1980's), so dropping into assembly to hand-code performance critical sections is just what people did, in order to get software to run smoothly.
Thankfully this is largely no longer the case, however it does still happen.
In Java's case, the JVM runs on top of multiple different architectures which makes optimization even more complicated.
Low-level instruction generation and optimization is just one topic under the umbrella of compiler design, which is a huge (and fascinating!) discipline to get into.
Because it's more work for the compiler to reduce the expression to something canonical (and it might even be impossible).
Also what good will it bring? What if the canonical expression triggers the slow path? Now you have no means to change it into the fast version.
Further, in the case of floating point operations, operation order matters for rounding. And with integer operations, the actual form used can be important for preventing overflow (of intermediate results).
Is Overflow UB so the compiler can choose to ignore the fact that 2x(i x i) could overflow differently from 2 x i x i?
I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?
How? Java is compiled to bytecode, you don't know the architecture of the system the code is going to run on. It's one of the reasons javac only implements the simplest optimizations possible (constants folding and the like)
Tried what specifically? This particular example, or something similar where the compiler generates code with different speeds for seemingly equivalent code?
userbinator|7 years ago
In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.
pcwalton|7 years ago
Not true. Like many such optimizations, loop unrolling can be useful because it makes downstream loads constant.
For example:
In this case, the compiler probably wants to unroll the loops so that it can straightforwardly forward the constant matrix entries directly to the matrix arithmetic. It'll likely be able to eliminate lots of operations that way.(You might ask "who would write this code?" As Schemers say: "macros do.")
See LLVM's heuristics: http://llvm.org/doxygen/LoopUnrollPass_8cpp.html#ad7c38776d7...
tomsmeding|7 years ago
nonsince|7 years ago
dragontamer|7 years ago
In most cases on modern systems, small loops should remain compact as possible, to stay in the uop cache. The "for" loop overhead (the inc, cmp, and jmp instructions) effectively execute in parallel. Modern systems are highly out-of-order and the for-loop overhead is virtually nil.
nwmcsween|7 years ago
astn|7 years ago
https://godbolt.org/z/aKtko5
jepler|7 years ago
cryptonector|7 years ago
utopcell|7 years ago
jepler|7 years ago
ychen306|7 years ago
beeforpork|7 years ago
Why? Is this optimisation particularly difficult to implement? Or is it just missed low-hanging fruit? It sure looks easy (like: rearrange expressions to keep the expression tree shallow and left-branching to avoid stack operations).
acdha|7 years ago
DannyBee|7 years ago
This is why there is a real commercial jvm market with azul.
yifanl|7 years ago
unknown|7 years ago
[deleted]
techopoly|7 years ago
azhenley|7 years ago
falcor84|7 years ago
https://codegolf.stackexchange.com/questions/11880/build-a-w...
Veedrac|7 years ago
dreamcompiler|7 years ago
garmaine|7 years ago
ww520|7 years ago
jcdavis|7 years ago
unknown|7 years ago
[deleted]
podsnap|7 years ago
bnegreve|7 years ago
The compiler should detect that the two expressions are strictly equivalent and generate whatever code it believes is the fastest.
Any idea why it is this way?
gnuvince|7 years ago
Here's another example using floating point numbers:
The output of this program is `500041.16 500012.16`, a difference of 25 for a program that computes the same result (unless I made a mistake).fragmede|7 years ago
However, the compiler ('s optimization step) is not magic and produces suboptimal code sometimes. Back when C was young, this was frequently the case (1970's and 1980's), so dropping into assembly to hand-code performance critical sections is just what people did, in order to get software to run smoothly.
Thankfully this is largely no longer the case, however it does still happen.
In Java's case, the JVM runs on top of multiple different architectures which makes optimization even more complicated.
Low-level instruction generation and optimization is just one topic under the umbrella of compiler design, which is a huge (and fascinating!) discipline to get into.
amelius|7 years ago
Also what good will it bring? What if the canonical expression triggers the slow path? Now you have no means to change it into the fast version.
Further, in the case of floating point operations, operation order matters for rounding. And with integer operations, the actual form used can be important for preventing overflow (of intermediate results).
crb002|7 years ago
pjmlp|7 years ago
https://www.youtube.com/watch?v=_cFwDnKvgfw
There are also other tools like JITWatch.
https://github.com/AdoptOpenJDK/jitwatch/wiki/Videos-and-Sli...
https://vimeo.com/181925278
alkonaut|7 years ago
I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?
BeeOnRope|7 years ago
Koshkin|7 years ago
microcolonel|7 years ago
qwerty456127|7 years ago
isbvhodnvemrwvn|7 years ago
polskibus|7 years ago
pjmlp|7 years ago
You have the old JIT, replaced by RyuJIT on .NET 4.6 and .NET Core.
Then .NET Native, which does AOT compilation via the same backend as Visual C++.
Followed by Mono's JIT/AOT implementation.
Windows/Windows Phone 8.x used a Bartok derived compiler for the MDIL format.
Same applies to Java though, as the answer only goes through what Hotspot does, but there are many other JIT/AOT compilers for Java as well.
openloop|7 years ago
[deleted]
idiot2|7 years ago
[deleted]
networkimprov|7 years ago
saagarjha|7 years ago
pmarreck|7 years ago
JohnL4|7 years ago
(My cup of bitterness doth overflow.)