(no title)
kbolino | 11 days ago
The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
kbolino | 11 days ago
The big question then is, why are ARM desktop (and server?) cores so far behind on wider SIMD support? It's not like Intel/AMD came up with these extensions for x86 yesterday; AVX2 is over 15 years old.
Aurornis|11 days ago
Very wide SIMD instructions require a lot of die space and a lot of power.
The AVX-512 implementation in Intel's Knight's Landing took up 40% of the die area (Source https://chipsandcheese.com/p/knights-landing-atom-with-avx-5... which is an excellent site for architectural analysis)
Most ARM desktop/mobile parts are designed to be low power and low cost. Spending valuable die space on large logic blocks for instructions that are rarely used isn't a good tradeoff for consumer apps.
Most ARM server parts are designed to have very high core counts, which requires small individual die sizes. Adding very wide SIMD support would grow die space of individual cores a lot and reduce the number that could go into a single package.
Supporting 256-bit or 512-bit instructions would be hard to do without interfering with the other design goals for those parts.
Even Intel has started dropping support for the wider AVX instructions in their smaller efficiency cores as a tradeoff to fit more of them into the same chip. For many workloads this is actually a good tradeoff. As this article mentions, many common use cases of high throughput SIMD code are just moving to GPUs anyway.
wtallis|11 days ago
That chip family was pretty much designed to provide just enough CPU power to keep the vector engines fed. So that 40% is an upper bound, what you get when you try to build a GPU out of somewhat-specialized CPU cores (which was literally the goal of the first generation of that lineage).
For a general purpose chip, there's no reason to spend that large a fraction of the area on the vector units. Something like the typical ARM server chips with lots of weak cores definitely doesn't need each core to have a vector unit capable of doing 512-bit operations in a single cycle, and probably would be better off sharing vector units between multiple cores. For chips with large, high-performance CPU cores (eg. x86), a 512-bit vector unit will still noticeably increase the size of a CPU core, but won't actually dwarf the rest of the core the way it did for Xeon Phi.
aseipp|11 days ago
kbolino|11 days ago
It does seem like server hardware is adopting SVE at least, even if it's not always paired with wider registers. There are lots of non-math-focused instructions in there that benefit many kinds of software that isn't transferable to a GPU.
formerly_proven|11 days ago
unknown|11 days ago
[deleted]
happyPersonR|11 days ago
Buy new chips next year! Haha :)
jsheard|11 days ago
camel-cdr|11 days ago
You can treat both SVE and RVV as a regular fixed-width SIMD ISA.
"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.
If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.
Tuldok|11 days ago
kbolino|11 days ago
otherjason|11 days ago
0x000xca0xfe|11 days ago
hajile|11 days ago
If your code can go wide and has few branches (uses SIMD basically every cycle), either a GPU or matrix co-processor will handily beat the performance of several CPU cores all running together.
If your code can go wide, but is branchy (uses bursts of SIMD between branches), wider becomes even less worth it. If it takes 4 cycles to put through a 256-bit SIMD instruction and you have some branches between the next one, using a 128-bit SIMD with 2 instructions will either have them execute in parallel at the same 4 cycles or even in the worst case, they will pipeline to 5 cycles (that's just a single instruction bubble in the FPU pipeline).
You can increase this differential by going to a 512-bit pipeline, but if it's just occasional 512-bit, you can still match with 4 SIMD units (The latest couple of ARM cores have 6 SIMD units) and while pipelining out from 4 to 7 cycles means you need at least 3-cycle bubbles to break even, this still doesn't seem too unusual.
The one area where this seems to be potentially untrue is simulations working with loads of f64 numbers which can consistently achieve high density with code just branchy enough to make GPUs be inefficient. Most of these workloads are running on supercomputers though and the ARM competitor here is the Fujitsu A64FX which does have 512-bit SVE.
It's also worth noting that even modern x86 chips (by both AMD and Intel) seem to throttle under heavy 512-bit multi-core workloads. Reducing the clockspeed in turn reduces the integer performance which may make applications slower in some cases
All of this is why ARM/Qualcomm/Apple's chips with 128-bit SIMD and a couple AMX/SME units are very competitive in most workloads even though they seem significantly worse on paper.
dlcarrier|11 days ago
Emulators also use them a lot, often in unintended ways, because they are very flexible. This is partially because the emulator itself can use the flexibility to optimize emulation, but also because hand optimizing with SIMD instruction can significantly improve performance of any application, which is necessary for the low-performance processors common in videogame consoles.
brigade|11 days ago
Most high-end ARM cores were 4x128b FMA, and Cortex-X925 goes to 6x128b FMA. Contrast that to Intel that was 2x256b FMA for the longest, then 2x512b FMA, with another 1-2 pipelines that can't do FMA.
But ultimately, 4x128b ≈ 2x256b, and 2x256b < 6x128b < 2x512b in throughput. Permute is a different factor though, if your algorithm cares about it.
phonon|11 days ago
Cold_Miserable|11 days ago
leeter|11 days ago
kbolino|11 days ago
In Apple's case, they have both the GPU and the NPU to fall back on, and a more closed/controlled ecosystem that breaks backwards compatibility every few years anyway. But Qualcomm is not so lucky; Windows is far more open and far more backwards compatible. I think the bet is that there are enough users who don't need/care about that, but I would question why they would even want Windows in the first place, when macOS, ChromeOS, or even GNU/Linux are available.
jovial_cavalier|11 days ago
Also, it doesn't just speed up vector math. Compilers these days with knowledge of these extensions can auto-vectorize your code, so it has the potential to speed up every for-loop you write.
bhouston|11 days ago
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Dow...
My experience is that trying to get benefits from the vector extensions is incredibly hard and the use cases are very narrow. Having them in a standard BLAS implementation, sure, but outside of that I think they are not worth the effort.
jsheard|11 days ago
kbolino|11 days ago
SIMD is not limited to mathy linear algebra things anymore. Did you know that lookup tables can be accelerated with AVX2? A lot of branchy code can be vectorized nowadays using scatter/gather/shuffle/blend/etc. instructions. The benefits vary, but can be significant. I think a view of SIMD as just being a faster/wider ALU is out of date.
kccqzy|11 days ago
vintagedave|11 days ago
I've heard anecdotally that the old pre-LLVM Intel C++ Compiler also focused heavily on vectorisation and had some specific tradeoffs to achieve it. I think they use LLVM now too and for all I know they've made similar modifications that we did. But we see a decent number of code patterns that can and now are optimised.
adgjlsfhk1|11 days ago