top | item 42937400

(no title)

"don't run any faster than a sequence of simpler instructions"

This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.

For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.

discuss

brucehoult|1 year ago

Ok, there exist cores that don't have a penalty for scaled indexed addressing (though many do). Or is it that they don't have any benefit from non-indexed addressing? Do they simply take a clock speed hit?

But that is all missing the point of "true but irrelevant".

You can't just compare the speed of an isolated scaled indexed load/store. No one runs software that consists only, or even mostly, of isolated scaled indexed load/store.

You need to show that there is a measurable and significant effect on overall execution speed of the whole program to justify the extra hardware of jamming all of that into one instruction.

A good start would be to modify the compiler for your x86 or Arm to not use those instructions and see if you can detect the difference on SPEC or your favourite real-world workload -- the same experiment that Cocke conducted on IBM 370 and Patterson conducted on VAX.

But even that won't catch the possibility that a RISC-V CPU might need slightly more clock cycles but the processor is enough simpler that it can clock slightly higher. Or enough smaller that you can use less energy or put more cores in the same area of silicon.

And as I said, in the cases where the speed actually matters it's probably in a loop and strength-reduced anyway.

It's so lazy and easy to say that for every single operation faster is better, but many operations are not common enough to matter.

dzaima|1 year ago

So your argument isn't that it's irrelevant, but rather that it might be irrelevant, if you happen to have a core where the extra latency of a 64-bit adder on the load/store AGU pushes it just over to the next cycle.

Though I'd imagine that just having the extra cycle conditionally for indexed load/store instrs would still be better than having a whole extra instruction take up decode/ROB/ALU resources (and the respective power cost), or the mess that comes with instruction fusion.

And with RISC-V already requiring a 12-bit adder for loads/stores, thus and an increment/decrement for the top 52 bits, the extra latency of going to a full 64-bit adder is presumably quite a bit less than a full separate 64-bit adder. (and if the mandatory 64+12-bit adder already pushed the latency up by a cycle, a separate shNadd will result in two cycles of latency over the hypothetical adderless case, despite 1 cycle clearly being feasible!)

Even if the RISC-V way might be fine for tight loops, most code isn't such. And ideally most tight loops doing consecutive loads would vectorize anyway.

We're in a world where the latest Intel cores can do small immediate adds at rename, usually materializing them in consuming instructions, which I'd imagine is quite a bit of overhead for not that much benefit.

dzaima|1 year ago

Note that Zba's sh1add/sh2add/sh3add take care of the problem of separate shift+add.

But yeah, modern x86-64 doesn't have any difference between indexed and plain loads[0], nor Apple M1[1] (nor even cortex-a53, via some local running of dougallj's tests; though there's an extra cycle of latency if the scale doesn't match the load width, but that doesn't apply to typical usage).

Of course one has to wonder whether that's ended up costing something to the plain loads; it kinda saddens me seeing unrolled loops on x86 resulting in a spam of [r1+r2*8+const] addresses and the CPU having to evaluate that arithmetic for each, when typically the index could be moved out of the loop (though at the cost of needing to pointer-bump multiple pointers if there are multiple), but x86 does handle it so I suppose there's not much downside. Of course, not applicable to loads outside of tight loops.

I'd imagine at some point (if not already past 8-wide) the idea of "just go wider and spam instruction fusion patterns" will have to yield to adding more complex instructions to keep silicon costs sane.

[0]: https://uops.info/table.html?search=%22mov%20(r64%2C%20m64)%...

[1]: https://dougallj.github.io/applecpu/measurements/firestorm/L... vs https://dougallj.github.io/applecpu/measurements/firestorm/L...