Intel i7 loop performance anomaly

[+] rayiner|12 years ago|reply

Don't have a compiler handy (or for that matter anything more recent than a Core 2 Duo), but the thing to do would be to replace the call instruction with a same-length sequence of NOOPs to see if its an instruction alignment issue. He mentions making sure the loop header is aligned, but there are other alignment issues in play.

For example, in Intel processors with a u-op cache, a 32-byte fetch group that generates more than 18 u-ops has to go through the x86 decoder and cannot stream from cache.[1] The call instruction, being a pretty big instruction with that 32-bit immediate displacement, might force the loop onto two 32-byte lines, reducing the u-op count in each line below 18 and allowing the loop to stream from the u-op cache instead of going through the decoders.

In general, if adding instructions makes a loop go faster, it's some sort of alignment (maybe code layout is a better term?) issue. Intel CPU's have limits on things like the number of branch predictions per fetch group, number of decoded instructions per fetch group, etc. Some x86 CPUs can only track 3 branches in a 16-byte group, while it might be possible to encode 4 in unusually dense code. In such cases, the extra branch just won't be tracked by the predictor. Seemingly superfluous code can force some of the branches into a different 16-byte group, improving overall performance. (This can't be the case here, because there's just one branch, but it can pop up in branch-heavy code).

EDIT: Someone makes in the comments makes a good point that in any case, both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above.

[1] See: http://www.realworldtech.com/sandy-bridge/4

[+] cperciva|12 years ago|reply

In general, if adding instructions makes a loop go faster, it's some sort of alignment issue.

Not necessarily. In the early days of the P6 core, I found a simple loop which would take either 4.0 or 5.33 cycles depending on the internal CPU state -- exactly the same code, but running with an iteration count of 10^6 would take either 4000000 cycles or 5333333 cycles.

I had the good fortune to talk to a member of the team which designed the P6 core, and we concluded that this was due to a quirk of the out-of-order scheduler: In order to reduce circuit depths, the mechanism for finding and dispatching runnable uops wouldn't always dispatch the earliest available uops, and so -- depending on how the incoming uops were aligned within an internal buffer -- the P6 core would sometimes take a stream of instructions which could be executed in-order without pipeline stalls and run them out-of-order with less ILP.

In short, OOO cores are weird and horribly complicated and completely untrustworthy where performance is concerned.

[+] pbsd|12 years ago|reply

My first thought was uop cache as well. The fact that pre-Sandy Bridge processors have the intuitive performance behavior makes me more certain it's related.

One interesting experiment to try would be to realign the loop to 0..31 mod 32, and see how/if the behavior changes.

[+] nkurz|12 years ago|reply

both loops should fit in the loop buffer, which is after the u-op cache, and so wouldn't suffer from the 18-uop limit mentioned above

Both loops should fit, but I think the same limits for micro-op density apply. The Optimization Manual says the Loop Stream Detector requires that "All micro-ops are also resident in the Decoded ICache". I took this to mean if something can't be cached by the Decoded ICache, it can't be streamed by the LSD.

[+] stingraycharles|12 years ago|reply

That's is already being discussed in the comments, and the author claims NOOP-aligning the code has no effect.

[+] mgraczyk|12 years ago|reply

Is the 18 u-op limit still the case for Haswell?

[+] mgraczyk|12 years ago|reply

I think It's because the branch target is a memory access, so the dcache causes the execution pipe to stall on the load. In the tight loop with a call, the branch target is a call and there is time enough to pull from dcache before the load data is needed. I suspect that the i7 can't pull data from dcache immediately the jne instruction, so you get a hiccup. Try adding a second noop in to tightloop as the target for jne.

[+] djcapelis|12 years ago|reply

The trace cache on that chip likely stores a u-op trace that includes the target of the call instruction, which means the memory subsystem shouldn't even need to fetch it separately since it'll come straight back from the trace cache right in line as part of the instruction sequence.

[+] xorblurb|12 years ago|reply

noops are removed by the scheduler. Depending on what is happening precisely, this might not help.

[+] songgao|12 years ago|reply

I tested on Intel Xeon W3680, which is at least 2 years old, and the result is opposite. Perhaps it only happens on newer CPUs?

  gcc -O2 -o main main.c
  perf stat -r 10 -e cycles,instructions ./main t
  
   Performance counter stats for './main t' (10 runs):
  
       2,408,573,034 cycles                    #    0.000 GHz                      ( +-  0.07% )
       2,401,351,221 instructions              #    1.00  insns per cycle          ( +-  0.00% )
  
         0.723734145 seconds time elapsed                                          ( +-  0.12% )
  
  perf stat -r 10 -e cycles,instructions ./main c
  
   Performance counter stats for './main c' (10 runs):
  
       2,802,522,431 cycles                    #    0.000 GHz                      ( +-  0.00% )
       3,201,523,974 instructions              #    1.14  insns per cycle          ( +-  0.00% )
  
         0.842082646 seconds time elapsed                                          ( +-  0.07% )

[+] rayiner|12 years ago|reply

That's Nehalem u-arch, which (unlike Sandy Bridge and Haswell), does not have a u-op cache. So that's an interesting data point.

[+] thelucky41|12 years ago|reply

I've run into something similiar on a different benchmark where inserting some 'nops' to the preamble for a function actually sped it up as much as 14% because it made the function align better with a memory boundary so the CPU could access it faster. Benchmarks, especially ones that don't control for the cases where 'luck'/alignment/register use/etc can influence the outcome, are terrible testcases to explore behaviour.

[+] mjn|12 years ago|reply

Google's microoptimizer actually has "try inserting random NOOPs" as one of its optimization passes: http://code.google.com/p/mao/

[+] mason55|12 years ago|reply

He mentioned in the comments that he ensured everything is aligned to 64-bit boundaries

[+] songgao|12 years ago|reply

Another interesting point is that, only gcc produced binary has that behavior. clang generated is the opposite. Check this comment about testing on Sandy Bridge with gcc and clang: http://eli.thegreenplace.net/2013/12/03/intel-i7-loop-perfor...

[+] abcd_f|12 years ago|reply

This begs for comparing assembly dumps of both.

[+] throwaway0094|12 years ago|reply

Similar effects on a Haswell-era Xeon 1240v3:

    $ perf stat -r 10 -e cycles,instructions ./loop t
    
     Performance counter stats for './loop t' (10 runs):
    
         2,631,020,892 cycles                    #    0.000 GHz                      ( +-  0.34% )
         2,403,261,796 instructions              #    0.91  insns per cycle          ( +-  0.00% )
    
           0.713682287 seconds time elapsed                                          ( +-  0.39% )
    
    $ perf stat -r 10 -e cycles,instructions ./loop c
    
     Performance counter stats for './loop c' (10 runs):
    
         2,237,200,035 cycles                    #    0.000 GHz                      ( +-  0.31% )
         3,202,823,909 instructions              #    1.43  insns per cycle          ( +-  0.00% )
    
           0.606741377 seconds time elapsed                                          ( +-  0.29% )

[+] AshleysBrain|12 years ago|reply

My armchair speculation:

- tightloop() does nothing useful and is not realistic assembly code, so modern CPU optimisations aren't targeted for it

- loop_with_extra_call() does nothing useful either, but might look like a more realistic instruction sequence - the call could modify 'counter', making it appear to be a realistic sequence which the CPU optimisations are targeted for.

End result: the CPU is better designed for handling cases like loop_with_extra_call(). In the real world this makes realistic code faster, and this benchmark is just a weird quirk.

Disclaimer: I am not qualified enough to say I know what I'm talking about.

[+] wreegab|12 years ago|reply

Could this be due to throttling down as per heat? First benchmark would cause more heat to be generated, thus throttling down going on?

[+] Guvante|12 years ago|reply

The instructions per cycle is the interesting number. It is independent of frequency.

[+] comex|12 years ago|reply

Boy. I know more or less informed speculation is the best we can do, but it would sure be fun if detailed information about the CPU were public so such things could actually be tracked down.

[+] anon_cownerd|12 years ago|reply

I agree, but asking this of Intel is the equivalent to asking Google to bare the details of their search ranking algorithm so that we could understand some odd search result. The clever tricks and optimizations they do in the cores are their crown jewels, just as the guts of clever, highly-tuned algorithms are for many software companies. It just so happens that many people have a high-level knowledge of modern out-of-order CPUs (just as many people have read Google's PageRank paper) and so can speculate about what might be going on inside.

[+] kelas|12 years ago|reply

Looks like people have been there before:

https://code.google.com/p/mao/wiki/NOPIN

The real question is whether any of these compiler heroics are actually any good or serve any purpose other than killing time and curiosity. No one knows how "optimized" NOP'ped code will fare on post-i7.

A modern Intel CPU is the most complex proprietary trade secret in the world. That's all Intel wants us to know.

[+] xemdetia|12 years ago|reply

I know this may be silly to suggest but isn't this just a processor pipelining problem? I feel like it's extremely likely that the loop with the extra call just lets the processor carry on better than a variable that it has to forcibly reload and store (to be compliant with volatile).

My assembly is a little rusty but I would just assume that the processor is blocking on memory and the waits just add up.

[+] djcapelis|12 years ago|reply

I don't think this should be the case on the type of out of order superscalar pipeline that this core has, as far as I'm aware of it. If there's a bind on waiting for memory an instruction should sit in a reservation station until its data comes in and that should take the same amount of time in either case. (In addition, with trivial accesses like this, it is highly likely a prefetcher would predict the access and have it fetched from cache in time. Though perhaps prefetch behavior works better in the longer loop?)

My intuition is that this is more likely to be behavior related to the trace cache and rayiner's comment points out several good suggestions on what might be happening here.

On the other hand, given the likely difference in how these instructions may be represented in the trace cache, there may also be interactions between the order and groups of instructions dispatched from the trace cache and that may change how things occupy different slots in reservation stations. Which may change the amount of instruction level parallelism in the loop. It doesn't seem likely, but it may be the case.

[+] s_kanev|12 years ago|reply

Spent a few hours trying to look into this. The results so far suggest it has to do with the register renaming mechanism, but not completely conclusive. You can check out my notes here: http://blog.skanev.org/2013/12/call-performance-weirdness.ht...

61 comments