The call shifts the loop from being limited at the backend level (mostly uops not being retired by having to wait for memory) to being limited at the frontend level. The core event to look for here is `IDQ_UOPS_NOT_DELIVERED.CORE`, which tells us whether the frontend is delivering the full 4 uops per cycle it is able to. In the tight loop this is almost always the case, whereas in the call loop this is rarely the case.
CALL, RET, and JNE all share the same execution port (6 in Skylake), so it seems plausible that the added pressure in this port prevents speculative execution from continuing with the loop at the same rate as the tight loop. If you look at the execution port breakdowns of each loop, port 6 dominates in the call loop, whereas the tight loop is bottlenecked at port 4 (the port where stores go).
By delivering fewer uops per cycle, the pressure on the backend is eased. But this is a delicate balance. If you add another call, the loop becomes much slower than the tight loop.
You can get a similar effect by replacing `__asm__("call foo")` with
I think you are on to something, but I don't think it's the final answer. Your explanation sounds true, but it doesn't really explain how greater pressure on P6 speeds things up.
I'm working with a slightly modified loop that counts down rather than up, so that the loop has slightly fewer instructions. I've switched to using an "add to memory" to reduce the instruction count further. None of these seem to directly affect the speed of the loop, but here's the fastest loop I have:
.p2align 3
1:
addq %rax, counter(%rip)
dec %rax
jmp 2f
2:
jne 1b
rep ret
On Skylake, it executes in 0.497 s. If I remove the "jump to the next instruction", it slows to 0.646 s. This agrees with your explanation. But here's the part I don't understand, and that may undercut your theory: if I change the first line to be ".p2align 4" (that is, if I increase the minimum alignment from 8 to 16) the speed is the same whether or not I have the extra jump. Even more confusingly, it's always the slower speed!
Here's what port usage looks like for the two different cases (both with the same instructions, just different alignments):
From the first line we see that both Fast and Slow have the same number of instructions executed (which makes sense, since both are executing the same instructions), but down below we see very different numbers of µops dispatched! I'm not sure what to make of this, but I think this shows that it is not (solely?) a front end issue.
So, I would argue that's the biggest source of _speedup_ in the second case. However, I'm really interested in whether that's true since I don't see a memory fence; so the memory should be in L0 cache for both cases; I have trouble believing that an unaligned access can be so much slower with the data in cache.
As for the `callq` to `repz retq`, I would venture a guess that the CPU's able to identify that there are no data dependencies there and the data's never even stored; I'd argue that it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all. Arguably. Like I said, I'm not an expert.
I'd say run it through Intel's code analyzer tool.
I'm sorry, but like the other comment at the bottom, your guesses are so far from reality that they are hard to respond to. IACA is great for what it does, but it's a static analyzer and knows nothing about alignment. L0 doesn't even exist on modern Intel processors. Memory fences would change things, but aren't part of the problem as stated. And your guess that "it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all" just doesn't have any bearing on they way processors work.
Your disclaimer does indicate that you have the self-awareness that you are not an expert, but the fact that you are trying to make an argument would normally indicate that you think you understand what's happening to some extent. Rather than just guessing, I think you'd benefit from trying some things out and seeing what the results are. Play with perf, it's fun!
Educated guess: The processor is trying to prefetch instructions. This loop is much tighter than most code that would typically be written, so an incrementing loop causes a branch misprediction. The processor is still loading instructions, so when it goes to find out what to do next, it takes a cache miss and burns some time trying to figure out its next instruction. However, a function call is very slow, however (even to a nop function) and it could delay the processor just long enough for the prefetch to complete.
You are welcome defend your guess with measurements, but your explanation comes across as "not even wrong". Incrementing the loop does not cause a branch prediction error (why would this happen?), the loop is small enough it comes out of the loop cache (so no cache misses), function calls are not slow (as evidenced by the example). Worse, as far as I can tell, your explanation doesn't actually even explain why adding a call would make the loop run faster.
[+] [-] pbsd|8 years ago|reply
CALL, RET, and JNE all share the same execution port (6 in Skylake), so it seems plausible that the added pressure in this port prevents speculative execution from continuing with the loop at the same rate as the tight loop. If you look at the execution port breakdowns of each loop, port 6 dominates in the call loop, whereas the tight loop is bottlenecked at port 4 (the port where stores go).
By delivering fewer uops per cycle, the pressure on the backend is eased. But this is a delicate balance. If you add another call, the loop becomes much slower than the tight loop.
You can get a similar effect by replacing `__asm__("call foo")` with
which consumes the same amount of port 6.[+] [-] nkurz|8 years ago|reply
I'm working with a slightly modified loop that counts down rather than up, so that the loop has slightly fewer instructions. I've switched to using an "add to memory" to reduce the instruction count further. None of these seem to directly affect the speed of the loop, but here's the fastest loop I have:
On Skylake, it executes in 0.497 s. If I remove the "jump to the next instruction", it slows to 0.646 s. This agrees with your explanation. But here's the part I don't understand, and that may undercut your theory: if I change the first line to be ".p2align 4" (that is, if I increase the minimum alignment from 8 to 16) the speed is the same whether or not I have the extra jump. Even more confusingly, it's always the slower speed!Here's what port usage looks like for the two different cases (both with the same instructions, just different alignments):
From the first line we see that both Fast and Slow have the same number of instructions executed (which makes sense, since both are executing the same instructions), but down below we see very different numbers of µops dispatched! I'm not sure what to make of this, but I think this shows that it is not (solely?) a front end issue.[+] [-] exikyut|8 years ago|reply
2. I'm... guessing that gcc/clang are too dumb to be able to be taught this and get the balance right.
[+] [-] bufferoverflow|8 years ago|reply
[+] [-] vivaamerica1|8 years ago|reply
[+] [-] zwerdlds|8 years ago|reply
[+] [-] acqq|8 years ago|reply
https://news.ycombinator.com/item?id=6842338
It seems that the solution is not the top voted comment, but due to mgraczyk.
[+] [-] ChuckMcM|8 years ago|reply
[+] [-] dang|8 years ago|reply
[+] [-] nimos|8 years ago|reply
[+] [-] inetknght|8 years ago|reply
First, the former appears to have at least one unaligned arithmetic:
> 400538: mov 0x200b01(%rip),%rdx # 601040 <counter>
...while the latter's equivalent instruction is 4-byte aligned:
> 40057d: mov 0x200abc(%rip),%rdx # 601040 <counter>
So, I would argue that's the biggest source of _speedup_ in the second case. However, I'm really interested in whether that's true since I don't see a memory fence; so the memory should be in L0 cache for both cases; I have trouble believing that an unaligned access can be so much slower with the data in cache.
As for the `callq` to `repz retq`, I would venture a guess that the CPU's able to identify that there are no data dependencies there and the data's never even stored; I'd argue that it probably never even gets executed because the instruction should fit in instruction cache and branch prediction cache and all. Arguably. Like I said, I'm not an expert.
I'd say run it through Intel's code analyzer tool.
https://software.intel.com/en-us/articles/intel-architecture...
Tangential video worth watching:
https://www.youtube.com/watch?v=2EWejmkKlxs&feature=youtu.be...
Edit: actually, thinking about it, it's not unaligned access, it's unaligned math. I don't think that should affect performance at all? Fun.
[+] [-] nkurz|8 years ago|reply
Your disclaimer does indicate that you have the self-awareness that you are not an expert, but the fact that you are trying to make an argument would normally indicate that you think you understand what's happening to some extent. Rather than just guessing, I think you'd benefit from trying some things out and seeing what the results are. Play with perf, it's fun!
[+] [-] dmix|8 years ago|reply
[+] [-] maxk42|8 years ago|reply
[+] [-] nkurz|8 years ago|reply