Intel i7 loop performance anomaly (2013)

[+] pbsd|8 years ago|reply

The call shifts the loop from being limited at the backend level (mostly uops not being retired by having to wait for memory) to being limited at the frontend level. The core event to look for here is `IDQ_UOPS_NOT_DELIVERED.CORE`, which tells us whether the frontend is delivering the full 4 uops per cycle it is able to. In the tight loop this is almost always the case, whereas in the call loop this is rarely the case.

CALL, RET, and JNE all share the same execution port (6 in Skylake), so it seems plausible that the added pressure in this port prevents speculative execution from continuing with the loop at the same rate as the tight loop. If you look at the execution port breakdowns of each loop, port 6 dominates in the call loop, whereas the tight loop is bottlenecked at port 4 (the port where stores go).

By delivering fewer uops per cycle, the pressure on the backend is eased. But this is a delicate balance. If you add another call, the loop becomes much slower than the tight loop.

You can get a similar effect by replacing `__asm__("call foo")` with

    __asm__("jmp 1f\n1:\n");
    __asm__("jmp 1f\n1:\n");

which consumes the same amount of port 6.

[+] nkurz|8 years ago|reply

I think you are on to something, but I don't think it's the final answer. Your explanation sounds true, but it doesn't really explain how greater pressure on P6 speeds things up.

I'm working with a slightly modified loop that counts down rather than up, so that the loop has slightly fewer instructions. I've switched to using an "add to memory" to reduce the instruction count further. None of these seem to directly affect the speed of the loop, but here's the fastest loop I have:

        .p2align 3
  1:
        addq    %rax, counter(%rip)
        dec %rax
        jmp 2f
  2:
        jne     1b
        rep ret

On Skylake, it executes in 0.497 s. If I remove the "jump to the next instruction", it slows to 0.646 s. This agrees with your explanation. But here's the part I don't understand, and that may undercut your theory: if I change the first line to be ".p2align 4" (that is, if I increase the minimum alignment from 8 to 16) the speed is the same whether or not I have the extra jump. Even more confusingly, it's always the slower speed!

Here's what port usage looks like for the two different cases (both with the same instructions, just different alignments):

                                   Fast         Slow
      INSTR_RETIRED_ANY      |  1600095750 | 1600095846
    CPU_CLK_UNHALTED_CORE    |  1643927320 | 2192886305
 UOPS_DISPATCHED_PORT_PORT_0 |   645482071 |  774370373
 UOPS_DISPATCHED_PORT_PORT_1 |   653347597 |  782702101
 UOPS_DISPATCHED_PORT_PORT_2 |   299084760 |  298170998
 UOPS_DISPATCHED_PORT_PORT_3 |   300474456 |  292800108
 UOPS_DISPATCHED_PORT_PORT_4 |  1410210909 | 1994332142
 UOPS_DISPATCHED_PORT_PORT_5 |   406138100 |  791887776
 UOPS_DISPATCHED_PORT_PORT_6 |   989257513 | 1338919000
 UOPS_DISPATCHED_PORT_PORT_7 |   200496146 |  209093087

From the first line we see that both Fast and Slow have the same number of instructions executed (which makes sense, since both are executing the same instructions), but down below we see very different numbers of µops dispatched! I'm not sure what to make of this, but I think this shows that it is not (solely?) a front end issue.

[+] exikyut|8 years ago|reply

1. Where did you learn all of this?

2. I'm... guessing that gcc/clang are too dumb to be able to be taught this and get the balance right.

[+] bufferoverflow|8 years ago|reply

99.9% of the time these anomalies are either due to cache/branching or due to alignment.

[+] vivaamerica1|8 years ago|reply

this is like saying 99.9% of the time cause of death is either due to sickness, old age, or accident.

[+] zwerdlds|8 years ago|reply

Article from 2013

[+] acqq|8 years ago|reply

Also discussed here previously:

https://news.ycombinator.com/item?id=6842338

It seems that the solution is not the top voted comment, but due to mgraczyk.

[+] ChuckMcM|8 years ago|reply

And oddly no followup article where he talks about how the branch prediction works in Sandy Bridge CPUs.

[+] dang|8 years ago|reply

Thanks! Added.

[+] nimos|8 years ago|reply

I wonder what happens if you replace the call with 1-2 noops?

[+] inetknght|8 years ago|reply

Disclaimer: I am not an expert and have not measured. This is armchair theory. But, I would argue two things.

First, the former appears to have at least one unaligned arithmetic:

> 400538: mov 0x200b01(%rip),%rdx # 601040 <counter>

...while the latter's equivalent instruction is 4-byte aligned:

> 40057d: mov 0x200abc(%rip),%rdx # 601040 <counter>

So, I would argue that's the biggest source of _speedup_ in the second case. However, I'm really interested in whether that's true since I don't see a memory fence; so the memory should be in L0 cache for both cases; I have trouble believing that an unaligned access can be so much slower with the data in cache.

As for the `callq` to `repz retq`, I would venture a guess that the CPU's able to identify that there are no data dependencies there and the data's never even stored; I'd argue that it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all. Arguably. Like I said, I'm not an expert.

I'd say run it through Intel's code analyzer tool.

https://software.intel.com/en-us/articles/intel-architecture...

Tangential video worth watching:

https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be...

Edit: actually, thinking about it, it's not unaligned access, it's unaligned math. I don't think that should affect performance at all? Fun.

[+] nkurz|8 years ago|reply

I'm sorry, but like the other comment at the bottom, your guesses are so far from reality that they are hard to respond to. IACA is great for what it does, but it's a static analyzer and knows nothing about alignment. L0 doesn't even exist on modern Intel processors. Memory fences would change things, but aren't part of the problem as stated. And your guess that "it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all" just doesn't have any bearing on they way processors work.

Your disclaimer does indicate that you have the self-awareness that you are not an expert, but the fact that you are trying to make an argument would normally indicate that you think you understand what's happening to some extent. Rather than just guessing, I think you'd benefit from trying some things out and seeing what the results are. Play with perf, it's fun!

[+] dmix|8 years ago|reply

Was the exact reason why ever figured out?

[+] maxk42|8 years ago|reply

Educated guess: The processor is trying to prefetch instructions. This loop is much tighter than most code that would typically be written, so an incrementing loop causes a branch misprediction. The processor is still loading instructions, so when it goes to find out what to do next, it takes a cache miss and burns some time trying to figure out its next instruction. However, a function call is very slow, however (even to a nop function) and it could delay the processor just long enough for the prefetch to complete.

[+] nkurz|8 years ago|reply

You are welcome defend your guess with measurements, but your explanation comes across as "not even wrong". Incrementing the loop does not cause a branch prediction error (why would this happen?), the loop is small enough it comes out of the loop cache (so no cache misses), function calls are not slow (as evidenced by the example). Worse, as far as I can tell, your explanation doesn't actually even explain why adding a call would make the loop run faster.

40 comments