Don't have a compiler handy (or for that matter anything more recent than a Core 2 Duo), but the thing to do would be to replace the call instruction with a same-length sequence of NOOPs to see if its an instruction alignment issue. He mentions making sure the loop header is aligned, but there are other alignment issues in play.
For example, in Intel processors with a u-op cache, a 32-byte fetch group that generates more than 18 u-ops has to go through the x86 decoder and cannot stream from cache.[1] The call instruction, being a pretty big instruction with that 32-bit immediate displacement, might force the loop onto two 32-byte lines, reducing the u-op count in each line below 18 and allowing the loop to stream from the u-op cache instead of going through the decoders.
In general, if adding instructions makes a loop go faster, it's some sort of alignment (maybe code layout is a better term?) issue. Intel CPU's have limits on things like the number of branch predictions per fetch group, number of decoded instructions per fetch group, etc. Some x86 CPUs can only track 3 branches in a 16-byte group, while it might be possible to encode 4 in unusually dense code. In such cases, the extra branch just won't be tracked by the predictor. Seemingly superfluous code can force some of the branches into a different 16-byte group, improving overall performance. (This can't be the case here, because there's just one branch, but it can pop up in branch-heavy code).
EDIT: Someone makes in the comments makes a good point that in any case, both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above.
In general, if adding instructions makes a loop go faster, it's some sort of alignment issue.
Not necessarily. In the early days of the P6 core, I found a simple loop which would take either 4.0 or 5.33 cycles depending on the internal CPU state -- exactly the same code, but running with an iteration count of 10^6 would take either 4000000 cycles or 5333333 cycles.
I had the good fortune to talk to a member of the team which designed the P6 core, and we concluded that this was due to a quirk of the out-of-order scheduler: In order to reduce circuit depths, the mechanism for finding and dispatching runnable uops wouldn't always dispatch the earliest available uops, and so -- depending on how the incoming uops were aligned within an internal buffer -- the P6 core would sometimes take a stream of instructions which could be executed in-order without pipeline stalls and run them out-of-order with less ILP.
In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.
My first thought was uop cache as well. The fact that pre-Sandy Bridge processors have the intuitive performance behavior makes me more certain it's related.
One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.
both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above
Both loops should fit, but I think the same limits for micro-op density apply. The Optimization Manual says the Loop Stream Detector requires that "All micro-ops are also resident in the Decoded ICache". I took this to mean if something can't be cached by the Decoded ICache, it can't be streamed by the LSD.
I think It's because the branch target is a memory access, so the dcache causes the execution pipe to stall on the load. In the tight loop with a call, the branch target is a call and there is time enough to pull from dcache before the load data is needed. I suspect that the i7 can't pull data from dcache immediately the jne instruction, so you get a hiccup. Try adding a second noop in to tightloop as the target for jne.
The trace cache on that chip likely stores a u-op trace that includes the target of the call instruction, which means the memory subsystem shouldn't even need to fetch it separately since it'll come straight back from the trace cache right in line as part of the instruction sequence.
I've run into something similiar on a different benchmark where inserting some 'nops' to the preamble for a function actually sped it up as much as 14% because it made the function align better with a memory boundary so the CPU could access it faster. Benchmarks, especially ones that don't control for the cases where 'luck'/alignment/register use/etc can influence the outcome, are terrible testcases to explore behaviour.
- tightloop() does nothing useful and is not realistic assembly code, so modern CPU optimisations aren't targeted for it
- loop_with_extra_call() does nothing useful either, but might look like a more realistic instruction sequence - the call could modify 'counter', making it appear to be a realistic sequence which the CPU optimisations are targeted for.
End result: the CPU is better designed for handling cases like loop_with_extra_call(). In the real world this makes realistic code faster, and this benchmark is just a weird quirk.
Disclaimer: I am not qualified enough to say I know what I'm talking about.
Boy. I know more or less informed speculation is the best we can do, but it would sure be fun if detailed information about the CPU were public so such things could actually be tracked down.
I agree, but asking this of Intel is the equivalent to asking Google to bare the details of their search ranking algorithm so that we could understand some odd search result. The clever tricks and optimizations they do in the cores are their crown jewels, just as the guts of clever, highly-tuned algorithms are for many software companies. It just so happens that many people have a high-level knowledge of modern out-of-order CPUs (just as many people have read Google's PageRank paper) and so can speculate about what might be going on inside.
The real question is whether any of these compiler heroics are actually any good or serve any purpose other than killing time and curiosity. No one knows how "optimized" NOP'ped code will fare on post-i7.
A modern Intel CPU is the most complex proprietary trade secret in the world. That's all Intel wants us to know.
I know this may be silly to suggest but isn't this just a processor pipelining problem? I feel like it's extremely likely that the loop with the extra call just lets the processor carry on better than a variable that it has to forcibly reload and store (to be compliant with volatile).
My assembly is a little rusty but I would just assume that the processor is blocking on memory and the waits just add up.
I don't think this should be the case on the type of out of order superscalar pipeline that this core has, as far as I'm aware of it. If there's a bind on waiting for memory an instruction should sit in a reservation station until its data comes in and that should take the same amount of time in either case. (In addition, with trivial accesses like this, it is highly likely a prefetcher would predict the access and have it fetched from cache in time. Though perhaps prefetch behavior works better in the longer loop?)
My intuition is that this is more likely to be behavior related to the trace cache and rayiner's comment points out several good suggestions on what might be happening here.
On the other hand, given the likely difference in how these instructions may be represented in the trace cache, there may also be interactions between the order and groups of instructions dispatched from the trace cache and that may change how things occupy different slots in reservation stations. Which may change the amount of instruction level parallelism in the loop. It doesn't seem likely, but it may be the case.
[+] [-] rayiner|12 years ago|reply
For example, in Intel processors with a u-op cache, a 32-byte fetch group that generates more than 18 u-ops has to go through the x86 decoder and cannot stream from cache.[1] The call instruction, being a pretty big instruction with that 32-bit immediate displacement, might force the loop onto two 32-byte lines, reducing the u-op count in each line below 18 and allowing the loop to stream from the u-op cache instead of going through the decoders.
In general, if adding instructions makes a loop go faster, it's some sort of alignment (maybe code layout is a better term?) issue. Intel CPU's have limits on things like the number of branch predictions per fetch group, number of decoded instructions per fetch group, etc. Some x86 CPUs can only track 3 branches in a 16-byte group, while it might be possible to encode 4 in unusually dense code. In such cases, the extra branch just won't be tracked by the predictor. Seemingly superfluous code can force some of the branches into a different 16-byte group, improving overall performance. (This can't be the case here, because there's just one branch, but it can pop up in branch-heavy code).
EDIT: Someone makes in the comments makes a good point that in any case, both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above.
[1] See: http://www.realworldtech.com/sandy-bridge/4
[+] [-] cperciva|12 years ago|reply
Not necessarily. In the early days of the P6 core, I found a simple loop which would take either 4.0 or 5.33 cycles depending on the internal CPU state -- exactly the same code, but running with an iteration count of 10^6 would take either 4000000 cycles or 5333333 cycles.
I had the good fortune to talk to a member of the team which designed the P6 core, and we concluded that this was due to a quirk of the out-of-order scheduler: In order to reduce circuit depths, the mechanism for finding and dispatching runnable uops wouldn't always dispatch the earliest available uops, and so -- depending on how the incoming uops were aligned within an internal buffer -- the P6 core would sometimes take a stream of instructions which could be executed in-order without pipeline stalls and run them out-of-order with less ILP.
In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.
[+] [-] pbsd|12 years ago|reply
One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.
[+] [-] nkurz|12 years ago|reply
Both loops should fit, but I think the same limits for micro-op density apply. The Optimization Manual says the Loop Stream Detector requires that "All micro-ops are also resident in the Decoded ICache". I took this to mean if something can't be cached by the Decoded ICache, it can't be streamed by the LSD.
[+] [-] stingraycharles|12 years ago|reply
[+] [-] mgraczyk|12 years ago|reply
[+] [-] mgraczyk|12 years ago|reply
[+] [-] djcapelis|12 years ago|reply
[+] [-] xorblurb|12 years ago|reply
[+] [-] songgao|12 years ago|reply
[+] [-] rayiner|12 years ago|reply
[+] [-] thelucky41|12 years ago|reply
[+] [-] mjn|12 years ago|reply
[+] [-] mason55|12 years ago|reply
[+] [-] songgao|12 years ago|reply
[+] [-] abcd_f|12 years ago|reply
[+] [-] throwaway0094|12 years ago|reply
[+] [-] AshleysBrain|12 years ago|reply
- tightloop() does nothing useful and is not realistic assembly code, so modern CPU optimisations aren't targeted for it
- loop_with_extra_call() does nothing useful either, but might look like a more realistic instruction sequence - the call could modify 'counter', making it appear to be a realistic sequence which the CPU optimisations are targeted for.
End result: the CPU is better designed for handling cases like loop_with_extra_call(). In the real world this makes realistic code faster, and this benchmark is just a weird quirk.
Disclaimer: I am not qualified enough to say I know what I'm talking about.
[+] [-] wreegab|12 years ago|reply
[+] [-] Guvante|12 years ago|reply
[+] [-] comex|12 years ago|reply
[+] [-] anon_cownerd|12 years ago|reply
[+] [-] kelas|12 years ago|reply
https://code.google.com/p/mao/wiki/NOPIN
The real question is whether any of these compiler heroics are actually any good or serve any purpose other than killing time and curiosity. No one knows how "optimized" NOP'ped code will fare on post-i7.
A modern Intel CPU is the most complex proprietary trade secret in the world. That's all Intel wants us to know.
[+] [-] xemdetia|12 years ago|reply
My assembly is a little rusty but I would just assume that the processor is blocking on memory and the waits just add up.
[+] [-] djcapelis|12 years ago|reply
My intuition is that this is more likely to be behavior related to the trace cache and rayiner's comment points out several good suggestions on what might be happening here.
On the other hand, given the likely difference in how these instructions may be represented in the trace cache, there may also be interactions between the order and groups of instructions dispatched from the trace cache and that may change how things occupy different slots in reservation stations. Which may change the amount of instruction level parallelism in the loop. It doesn't seem likely, but it may be the case.
[+] [-] s_kanev|12 years ago|reply