top | item 47061552

(no title)

jorvi | 11 days ago

Yeah, sounds like they're confusing AVX2 for AVX512. AVX2 has been common for a decade at least and greatly accelerates performance.

AVX512 is so kludgy that it usually leads to a detriment in performance due to the extreme power requirements triggering thermal throttling.

discuss

kimixa|11 days ago

AMD's implementation very much doesn't have that issue - it throttles slightly, maybe, but it's still a net benefit. The problem with Intel's implementation is that the throttling was immediate - and took noticeable time to then settle and actually start processing again - from any avx512 instruction, so the "occasional" avx512 instruction (in autovectorized code, or something like the occasional optimized memcpy or similar) was a net negative in performance. This meant that it only benefitted large chunks of avx512-heavy code, so this switching penalty was overcome.

But there's plenty in avx512 the really helps real algorithms outside the 512-wide registers - I think it would be perceived very differently if it was initially the new instructions on the same 256-wide registers - ie avx10 - in the first place, then extended to 512 as the transistor/power budgets allowed. AVX512 was just tying too many things together too early than "incremental extensions".

otherjason|11 days ago

See this correct comment above: https://news.ycombinator.com/item?id=47061696

AVX512 leading to thermal throttling is a common myth that from what I can tell traces its origins to a blog post about clock throttling on a particular set of low-TDP SKUs from the first generation of Xeon CPUs that supported it (Skylake-X), released over a decade ago: https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

The results were debated shortly after that by well-known SIMD authors that were unable to duplicate the results: https://lemire.me/blog/2018/08/25/avx-512-throttling-heavy-i...

In practice, this has not been an issue for a long time, if ever; clock frequency scaling for AVX modes has been continually improved in subsequent Intel CPU generations (and even more so in AMD Zen 4/5 once AVX512 support was added).

adrian_b|11 days ago

That was true only for the 14-nm Intel Skylake derivatives, which had very bad management of the clock frequency and supply voltage, so they scaled down the clock prophylactically, for fear that they would not be able to prevent overheating fast enough.

All AMD Zen 4 and Zen 5 and all of the Intel CPUs since Ice Lake that support AVX-512, benefit greatly from using it in any application.

Moreover the AMD Zen CPUs have demonstrated clearly that for vector operations the instruction-set architecture really matters a lot. Unlike the Intel CPUs, the AMD CPUs use exactly the same execution units regardless whether they execute AVX2 or AVX-512 instructions. Despite this, their speed increases a lot when executing programs compiled for AVX-512 (in part for eliminating bottlenecks in instruction fetching and decoding, and in part because the AVX-512 instruction set is better designed, not only wider).

badgersnake|11 days ago

I think that's slightly old information as well, AVX512 works well on Zen5.

SecretDreams|11 days ago

Agree. It's only recently with modern architectures in the server space that avx512 has shown some benefit. But avx2 is legit and has been for a long time.

corysama|11 days ago

In gamedev it takes 7-10 years before you can require a new tech without getting a major backlash. AMD came out with AVX2 support in 2015. And, the (vocal minority) petitions to get AVX2 requirements removed from major games and VR systems are only now starting to quiet down.

So, in order to make use of users new fancy hardware without abandoning other users old and busted hardware, you have to support multiple back-ends. Same as it ever was.

Actually, a lot easier than it ever was today. Doom 3 famously required Carmack to reimplement the rendering 6 times to get the same results out of 6 different styles of GPUs that were popular at the time.

ARB Basic Fallback (R100) Multi-pass Minimal effects, no specular.

NV10 GeForce 2 / 4 MX, 5 Passes, Used Register Combiners.

NV20 GeForce 3 / 4 Ti, 2–3 Passes, Vertex programs + Combiners.

R200 Radeon 8500–9200, 1 Pass, Used ATI_fragment_shader.

NV30 GeForce FX Series, 1 Pass, Precision optimizations (FP16).

ARB2 Radeon 9500+ / GF 6+, 1 Pass, Standard high-end GLSL-like assembly.

https://community.khronos.org/t/doom-3/37313