top | item 37245325

(no title)

fmstephe | 2 years ago

It's interesting that you say conditional move here.

I am confused by this behaviour, and although I definitely don't know what the answer is here; the non-lol version does have a CSEL (https://developer.arm.com/documentation/dui0802/b/CSEL) which is totally missing from the lol version.

Non-lol https://godbolt.org/z/ds1raTYc9

lol https://godbolt.org/z/c3afrb6bG

discuss

order

dataflow|2 years ago

Ah there you go, that's the conditional move on ARM. Yeah, those are slow.

tylerhou|2 years ago

"Conditional move is slow" is not the right takeaway. (In fact, conditional move can be much /faster/ than a branch in certain circumstances (e.g. hard-to-predict branches, like binary search).)

The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)

However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!

On the other hand, conditional moves would be much faster for a loop like:

  bigs := make([]int, len(numsA))
  for idx, v : range numsA {
    if numsA[idx] > numsB[idx] {
      bigs[idx] = numsA[idx]
    } else {
      bigs[idx] = numsB[idx]
    }
  }
There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)

fmstephe|2 years ago

Yeah, after reading the blog post I felt reasonably confident that this is a compiler bug (in terms of perf).

Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).