(no title)
dvas | 2 years ago
Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer.
dvas | 2 years ago
Interesting to test out on the ARM Mac, and see if different dependency chains show significant latency penalties / in with reorder buffer.
raphlinus|2 years ago
The best current info I could find for the latency advice is [2]. Quoting, "Moving data from NEON to ARM registers is Cortex-A8 is expensive." Looking at [3] partially reveals the reason why: the NEON pipeline is entirely after the integer pipeline, so moves from integer to NEON are cheap, but the reverse direction is potentially a large pipeline stall. This is an unusual design decision that as far as I know is not true for any other CPUs. Edit: I found [4], which is a more authoritative source.
[1]: https://github.com/google/music-synthesizer-for-android/blob...
[2]: https://community.arm.com/support-forums/f/armds-forum/757/n...
[3]: https://www.design-reuse.com/articles/11580/architecture-and...
[4]: https://developer.arm.com/documentation/den0018/a/Optimizing...
dvas|2 years ago
For Cortex-A8 from [4] and the others you have linked, It makes sense to me now regarding the instruction passing data between registers, filling out the pipeline and then stalling.
Will have a peek at ARMv8/ARMv9 arch's and see what they did there regarding SVE/SVE2.