(a) You mention that the NVidia docs push people to use libraries, etc. to really get high performance CUDA kernels, rather than writing them themselves. My argument would be that SIMD is exactly the same - they're something really that are perfect if you're writing a BLAS implementation but are too low level for most developers thinking about a problem to make use of.
(b) You show a problem where autovectorisation fails because of branching, and jump straight to intrinsics as the solution which you basically say are ugly. Looking at the intrinsic code, you're using a mask to deal with the branching. But there's a middle ground - almost always you would want to try restructuring the problem, e.g. splitting up loops and adding masks where there's conditions - i.e. lean into the SIMD paradigm. This would also be the same advice in CUDA.
(c) As you've found, GCC actually performs quite poorly for x86-64 optimisations compared to Intel. It's not always clear cut though, the Intel Compiler for e.g. sacrifices IEEE 764 float precision and go down to ~14 digits of precision in it's defaults, because it sets the flag `-fp-model=fast -fma`. This is true of both the legacy and new Intel compiler. If you switch to `-fp-model=strict` then you may find that the results are closer.
(d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.
Regarding (b), I would never rely on auto vectorization because I have no insight into it. The only way to check if my code is written such that auto-vectorization can do its thing is to compile my program with every combination of compiler, optimization setting and platform I intend to support, disassemble all the resulting binaries, and analyze the disassembly to try to figure out if it did autovectorization in the way I expect. That's a terrible developer experience; writing intrinsics by hand is much easier, and more robust. I'd need to re-check every piece of autovectorized code after every change and after every compiler upgrade.
I just treat autovectorization like I treat every other fancy optimization that's not just constant folding and inlining: nice when it happens to work out, it probably happens to make my code a couple percent faster on average, but I absolutely can't rely on it in any place where I depend on the performance.
> (d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.
To re-iterate, this is our observation as well. The first AVX512 processors would execute such code quite fast for a short time, then overheat and throttle, leading to a worse wall-time performance than the corresponding AVX256 code.
I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.
Re (b) I'm curious what that middle ground is. Is there any simple refactor to help GCC to get rid of this `if`? (Note, ISPC did fine here)
(c) Just to be clear, all the codes in benchmark figures (baseline and SIMD) were compiled with fast-math flags.
Regarding (a), one of the points I wanted to get across was that it didn't feel that complicated to program in the end as I had thought. Porting to AVX-512 felt mechanical (hence the success of LLMs in one-shotting the whole thing).
This is a subjective opinion, depends on programmer's experience etc- so I won't dwell on it. I just wish more CPU programmers gave it a try.
If you wish to see some speedups using AVX512, without limiting yourself to C or C++, you might want to try ISPC (https://ispc.github.io/index.html).
You'll get sane aliasing rules from the perspective of performance, multi-target binaries with dynamic dispatching and a lot more control over the code generated.
Hi, I actually mentioned ISPC several times there. And although I strenuously avoided crowning one approach "better" over the other, it is worth pointing out that 1) Many of these benefits of ISPC can be had from explicit SIMD libraries like Google's Highway, and 2) ISPC (or any SIMT model) is a departure from how the underlying hardware works, and as the AI community is discovering with GPU, this abstraction can sometimes be lot more headache than its worth.
> CUDA architects [...] happily exposed every ugly details of underlying hardware to the programmer if that allows a bit more performance.
After spending more than a decade dancing around all the underlying x86 hidden stuff for low-level optimization, I appreciate CUDA a lot. Everything is there under your total control. No more one size fits all. Higher barrier of entry but no surprises and less time spent debugging to figure out what landmine your code stepped into.
> In CPU world there is a desire to shield programmers from those low-level details, but I think there are two interesting forces at play now-a-days that’ll change it soon. On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.
The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.
I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.
The good news is that the portable subset of SIMD is all you really need anyway. If you go beyond the portable subset, you need per-architecture code writing and testing, and you're mostly talking about pretty small gains relative to the cost.
> The answer, if it’s not obvious from my tone already:), is 8%.
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
Yeah N is big enough that entire data isn't in the cache, but the memory access pattern here is the next best thing: totally linear, predictable access. I remember seeing around 94%+ L1d cache hit rate.
Isn't it another way of saying what the author says in the previous paragraph, namely that "ideal SIMD speedup can only come from problems that are compute bound"?
If the cost of getting the input data into the cache is already large compared to processing it with the non-vectorized code, then SIMD cannot achieve meaningful speedup. The opposite of this condition (processing is expensive compared to the cost of data into the cache) is basically the definition "compute bound".
>On one hand, Dennard Scaling (aka free lunch) is long gone, hardware landscape is getting increasingly fragmented and specialized out of necessity, software abstractions are getting leakier, forcing developers to be aware of the lowest levels of abstraction, hardware, for good performance.
There are lots of people using Javascript frameworks to build slow desktop and mobile software.
I wonder if the excess CO2 emitted by devices around the world using bloated software that has no need to be so (hullo MS Teams) could be calculated in terms of # of cross atlantic voyages of jets.
Initial example takes array pointers without the __restrict__ keyword/extension so compiler might assume they could be aliased to same address space and will code defensively.
Would be interesting to see if auto vec performs better with that addition.
What I get in these article is that the original intent on C language stands true.
Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).
Complex compiler doing crazy optimizations, in my opinion, is not worth it.
I remember being quite surprised that my implementation which uses manual stack updates is much slower than what compiler had with recursion.
Turns out, I was pushing and popping from stack on every conceptual "recursive call", but compiler figured out it can keep 2-3 recursive levels in registers and pop/push 30% of the time, had more stuff in memory than my version as well.
Even when I reduced memory read/writes to ~50% of the recursive program, kept most of the state in registers, the recursive program was faster anyway due to just using more registers than me.
I realized then that I cannot reason about the microoptimizations at all if I'm coding in a high-level language like C or C++.
Hard to predict the CPU pipeline, sometimes profile guided optimization gets me there faster than my own silliness of assuming I can reason about it.
> Complex compiler doing crazy optimizations, in my opinion, is not worth it.
For these optimisations that are in the back-end, they are used for other languages that can be higher-level or that cannot drop to assembler as easily. C is just one of the front-ends of modern compiler suites.
No. Assuming `k` is small enough, which in practice often is, the arithmetic intensity of this kernel is 25-90 Flops/Byte, way above the roofline knee of any modern CPU.
physicsguy|1 month ago
(a) You mention that the NVidia docs push people to use libraries, etc. to really get high performance CUDA kernels, rather than writing them themselves. My argument would be that SIMD is exactly the same - they're something really that are perfect if you're writing a BLAS implementation but are too low level for most developers thinking about a problem to make use of.
(b) You show a problem where autovectorisation fails because of branching, and jump straight to intrinsics as the solution which you basically say are ugly. Looking at the intrinsic code, you're using a mask to deal with the branching. But there's a middle ground - almost always you would want to try restructuring the problem, e.g. splitting up loops and adding masks where there's conditions - i.e. lean into the SIMD paradigm. This would also be the same advice in CUDA.
(c) As you've found, GCC actually performs quite poorly for x86-64 optimisations compared to Intel. It's not always clear cut though, the Intel Compiler for e.g. sacrifices IEEE 764 float precision and go down to ~14 digits of precision in it's defaults, because it sets the flag `-fp-model=fast -fma`. This is true of both the legacy and new Intel compiler. If you switch to `-fp-model=strict` then you may find that the results are closer.
(d) AVX512 is quite hardware specific. Some processors execute these instructions much better than others. It's really a collection of extensions, and you get frequency downclocking that's better/worse on different processors as these instructions are executed.
mort96|1 month ago
I just treat autovectorization like I treat every other fancy optimization that's not just constant folding and inlining: nice when it happens to work out, it probably happens to make my code a couple percent faster on average, but I absolutely can't rely on it in any place where I depend on the performance.
grumbelbart2|1 month ago
To re-iterate, this is our observation as well. The first AVX512 processors would execute such code quite fast for a short time, then overheat and throttle, leading to a worse wall-time performance than the corresponding AVX256 code.
I am not sure if there is a better way to find the fastest code path besides "measure on the target system", which of course comes with its own challenges.
shihab|1 month ago
Re (b) I'm curious what that middle ground is. Is there any simple refactor to help GCC to get rid of this `if`? (Note, ISPC did fine here)
(c) Just to be clear, all the codes in benchmark figures (baseline and SIMD) were compiled with fast-math flags.
Regarding (a), one of the points I wanted to get across was that it didn't feel that complicated to program in the end as I had thought. Porting to AVX-512 felt mechanical (hence the success of LLMs in one-shotting the whole thing).
This is a subjective opinion, depends on programmer's experience etc- so I won't dwell on it. I just wish more CPU programmers gave it a try.
nnevatie|1 month ago
If you wish to see some speedups using AVX512, without limiting yourself to C or C++, you might want to try ISPC (https://ispc.github.io/index.html).
You'll get sane aliasing rules from the perspective of performance, multi-target binaries with dynamic dispatching and a lot more control over the code generated.
theowaway|1 month ago
shihab|1 month ago
majke|1 month ago
alecco|1 month ago
After spending more than a decade dancing around all the underlying x86 hidden stuff for low-level optimization, I appreciate CUDA a lot. Everything is there under your total control. No more one size fits all. Higher barrier of entry but no surprises and less time spent debugging to figure out what landmine your code stepped into.
user3939382|1 month ago
pjmlp|1 month ago
The problem is that not all programming languages expose SIMD, and even if they do it is only a portable subset, additionally the kind of skills that are required to be able to use SIMD properly isn't something everyone is confortable doing.
I certainly am not, still managed to get around with MMX and early SSE, can manage shading languages, and that is about it.
adgjlsfhk1|1 month ago
camel-cdr|1 month ago
Not if the data is small and in cache.
> The performant route with AVX-512 would probably include the instruction vpconflictd, but I couldn’t really find any elegant way to use it.
I think the best way to do this is duplicate sum_r and count 16 times, so each pane has a seperate accumulation bucket and there can't be any conflicts. After the loop, you quickly do a sum reduction for each of the 16 buckets.
shihab|1 month ago
praptak|1 month ago
Isn't it another way of saying what the author says in the previous paragraph, namely that "ideal SIMD speedup can only come from problems that are compute bound"?
If the cost of getting the input data into the cache is already large compared to processing it with the non-vectorized code, then SIMD cannot achieve meaningful speedup. The opposite of this condition (processing is expensive compared to the cost of data into the cache) is basically the definition "compute bound".
ecesena|1 month ago
See also https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...
DeathArrow|1 month ago
There are lots of people using Javascript frameworks to build slow desktop and mobile software.
user_7832|1 month ago
chillitom|1 month ago
Would be interesting to see if auto vec performs better with that addition.
chillitom|1 month ago
auto aligned_p = std::assume_aligned<16>(p)
fithisux|1 month ago
Use C as a common platform denominator without crazy optimizations (like tcc). If you need performance, specialize, C gives you the tools to call assembly (or use compiler some intrinsic or even inline assembly).
Complex compiler doing crazy optimizations, in my opinion, is not worth it.
vjerancrnjak|1 month ago
Turns out, I was pushing and popping from stack on every conceptual "recursive call", but compiler figured out it can keep 2-3 recursive levels in registers and pop/push 30% of the time, had more stuff in memory than my version as well.
Even when I reduced memory read/writes to ~50% of the recursive program, kept most of the state in registers, the recursive program was faster anyway due to just using more registers than me.
I realized then that I cannot reason about the microoptimizations at all if I'm coding in a high-level language like C or C++.
Hard to predict the CPU pipeline, sometimes profile guided optimization gets me there faster than my own silliness of assuming I can reason about it.
kergonath|1 month ago
For these optimisations that are in the back-end, they are used for other languages that can be higher-level or that cannot drop to assembler as easily. C is just one of the front-ends of modern compiler suites.
eru|1 month ago
saagarjha|1 month ago
shihab|1 month ago
NohatCoder|1 month ago
BoredomIsFun|1 month ago
randomint64|1 month ago