top | item 38975633

(no title)

dvas | 2 years ago

Got me curious regarding ARM latency, wonder if that was related to particular instructions which added more latency or transfers between the registers/memory subsystem internals. Also on the off-chance that you remember, did you inline intrinsics or let the compiler auto-optimize?

Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer.

discuss

raphlinus|2 years ago

This is for Cortex A8, which was the chip in the Nexus One. I wrote the original version of sound synthesis directly in ARM assembler[1]. It was very highly optimized, I remember using a cycle counting app that flagged any dependency chain that would cause the processor to stall, and ultimately utilization was in the 90%+ range. Back in those days, processors were simple enough you could do this kind of optimization by hand. By the time of Cortex A15 (Nexus 10 etc), instruction issue was out-of-order and much harder to reason about.

The best current info I could find for the latency advice is [2]. Quoting, "Moving data from NEON to ARM registers is Cortex-A8 is expensive." Looking at [3] partially reveals the reason why: the NEON pipeline is entirely after the integer pipeline, so moves from integer to NEON are cheap, but the reverse direction is potentially a large pipeline stall. This is an unusual design decision that as far as I know is not true for any other CPUs. Edit: I found [4], which is a more authoritative source.

[1]: https://github.com/google/music-synthesizer-for-android/blob...

[2]: https://community.arm.com/support-forums/f/armds-forum/757/n...

[3]: https://www.design-reuse.com/articles/11580/architecture-and...

[4]: https://developer.arm.com/documentation/den0018/a/Optimizing...

dvas|2 years ago

Awesome reply, and thank you for the well put together answer linking to resources and for sharing your experience.

For Cortex-A8 from [4] and the others you have linked, It makes sense to me now regarding the instruction passing data between registers, filling out the pipeline and then stalling.

Will have a peek at ARMv8/ARMv9 arch's and see what they did there regarding SVE/SVE2.