I really enjoy this summary of the AVX512 instruction set evolution. It makes me curious to know more about the AVX512 microarchitecture evolution too. How practical is AVX512 likely to be over the coming years? How much do we have to worry about side-effects like throttling the CPU clock speed down when the wide datapaths are exercised, etc?
Can someone wake me up when it is a no-brainer to use AVX512 instructions on CPUs that support them? :-)
I would point out that there _are_ consumer-grade chips with AVX-512 beyond the Xeons and Cannonlake: the Skylake-X chips such as i7-7800X, which even happen to have 2 512-bit FMA units, unlike some of the cheaper Xeons.
I will point out three things that make the Sunny Cove exciting to me, beyond the obvious new instructions (based on the available slides):
- There are now two ports for vector shuffles. For shuffle-heavy applications, which is often the case with bit-manipulation kernels, this is great news. This seems improved from Cannonlake.
- This was already present on the Cannonlake, but integer division is drastically improved, and goes from ~30ish uops to 4. Divisions that could take up to 90 cycles will now take <=18.
- There are now 4 LEA ports, up from 2, which for address calculation and small integer multiplications are quite useful.
What is the state of compiler support for these very advanced instruction sets ? Can the average developer benefit by basically adding a few compiler flags ?
Also, each new Intel platform seems to bring additional instructions but most software isn't made available in a wide range of microarchitecture-specific builds. Is there typically capability detection going on behind the scenes ?
Some of the operations described here seem so specific, that I have a hard time imagining compilers being able to spot the relevant patterns in source code that can make use of them (then again, I'm not a specialist). I guess these are explicitly coded for in Assembly ?
At this moment, as a consumer, I'm more concerned about Spectre mitigation impact, power draw and cost - among other things.
And currently, AMD is definitely winning there, and it may be my processor of choice for the next few years, until Intel fixes shortcomings in all those areas. 10nm is a step in the right directions, but the price/performance ratio is nowhere close to the mark.
In the beginning you write that "VBMI [...] is the only extension that we’ve seen before – it’s in Cannonlake." but later you write that "VPOPCNTDQ is older (from the MIC product line)"
So which is it? Or am I misunderstanding something?
I wonder if this will lead to significant speed ups in databases like ElasticSearch, where operations on large bitmaps (each pixel representing a term’s presence in a document) are commonplace. Do you have any insight into this?
You mentioned there are NUCs that have AVX-512, which surprised me. Those looking for such a NUC: search for NUCs with the Core i3-8121U CPU. This is a cheap AVX-512 entry point if you want to experiment. (~$400 for the NUC).
Do we know if Zen 2 is going to support AVX-512? Even if in 2x256-bit fashion... That could boost adoption as many enthusiasts will be changing platform soon.
I hope they release more information soon, like latency and throughput for all these avx512 instructions on ice lake.
No 2nd FMA really sucks, hope they add it when they do desktop-
I can't tell from the micro architecture slide if the yellow box labeled "ALU" that is found on port 0 and 5 refers to only integer ops, or if that includes float(add/mul).
[+] [-] glangdale|6 years ago|reply
[+] [-] lukego|6 years ago|reply
Can someone wake me up when it is a no-brainer to use AVX512 instructions on CPUs that support them? :-)
[+] [-] pbsd|6 years ago|reply
I will point out three things that make the Sunny Cove exciting to me, beyond the obvious new instructions (based on the available slides):
- There are now two ports for vector shuffles. For shuffle-heavy applications, which is often the case with bit-manipulation kernels, this is great news. This seems improved from Cannonlake.
- This was already present on the Cannonlake, but integer division is drastically improved, and goes from ~30ish uops to 4. Divisions that could take up to 90 cycles will now take <=18.
- There are now 4 LEA ports, up from 2, which for address calculation and small integer multiplications are quite useful.
[+] [-] renaudg|6 years ago|reply
Also, each new Intel platform seems to bring additional instructions but most software isn't made available in a wide range of microarchitecture-specific builds. Is there typically capability detection going on behind the scenes ?
Some of the operations described here seem so specific, that I have a hard time imagining compilers being able to spot the relevant patterns in source code that can make use of them (then again, I'm not a specialist). I guess these are explicitly coded for in Assembly ?
[+] [-] RcouF1uZ4gsC|6 years ago|reply
[+] [-] vkaku|6 years ago|reply
And currently, AMD is definitely winning there, and it may be my processor of choice for the next few years, until Intel fixes shortcomings in all those areas. 10nm is a step in the right directions, but the price/performance ratio is nowhere close to the mark.
But that's just my take on that.
[+] [-] chithanh|6 years ago|reply
In the beginning you write that "VBMI [...] is the only extension that we’ve seen before – it’s in Cannonlake." but later you write that "VPOPCNTDQ is older (from the MIC product line)"
So which is it? Or am I misunderstanding something?
[+] [-] btown|6 years ago|reply
[+] [-] dmbaggett|6 years ago|reply
[+] [-] klyrs|6 years ago|reply
[+] [-] rb808|6 years ago|reply
I feel like I met that one other guy in the world.
[+] [-] zeus_hammer|6 years ago|reply
[+] [-] yifanl|6 years ago|reply
One of these is not like the others
[+] [-] Narishma|6 years ago|reply
[+] [-] FullyFunctional|6 years ago|reply
[+] [-] glangdale|6 years ago|reply
[+] [-] bitL|6 years ago|reply
[+] [-] sandeatr|6 years ago|reply
[+] [-] sandeatr|6 years ago|reply
No 2nd FMA really sucks, hope they add it when they do desktop-
I can't tell from the micro architecture slide if the yellow box labeled "ALU" that is found on port 0 and 5 refers to only integer ops, or if that includes float(add/mul).
[+] [-] breadandcrumbel|6 years ago|reply