My conclusion (feel free to enlighten me if I am wrong) is that a system will profit by having more cores instead of AVX-512 for the same power consumption.
It is time to stop quoting this rant, which is rather far removed from actual practice and reality.
Specifically for sorting, Intel's sort and ours can be about 10x as fast as scalar.
AVX-512 has high power? Great! Power is work per time. Let's have more of that.
It is _useful_ work per _energy spent_ that we want to maximize, and there scalar code falls flat. Consider that OoO scheduling/control overhead is the real culprit; executing one instruction costs far more energy than the actual operation. SIMD divides that fixed cost over 16 elements, leading to 5-10x higher power efficiency.
More cores? The shared-memory fabric is already bursting at the seams (see on-chip bottlenecks of Genoa), we are already cramming in too many for the current memory interface.
What would actually be necessary: more bandwidth! A64FX CPUs actually beat GPUs in 2019 for supercomputing applications thanks to their HBM. Intel's Sappire Rapids Max also has this. Let's build more of those, at a decent price. And start using the wide vector units, they are there precisely because lots of highly-clocked cores in one big NUMA domain is not always the best way forward.
AMD enabled AVX-512 without increasing their power consumption. To do that their AVX-512 runs only half of max speed compared Intel when using 512 bit registers (aka max flops is same as with AVX2 256 bit registers).
Still, because the new instructions in AVX-512 are beneficial in many scenarios the actual speed up is still often 10-20% in code that benefits from the new instructions.
Then 4 years later AMD has the option to bump to full-speed AVX-512 if it is really needed. This is the same they did with AVX2 initially.
A lot of the downclocking issues he was talking about then are less severe now on newer Intel cpus and AMD cpus, which changes the calculus a lot.
You could probably find a workload where your conclusion is correct but I think the vast majority of workloads would be faster with AVX-512 if you have the time to leverage it.
When those benchmarks were first done, for random input, vqsort was faster once you get around ~20,000 items, but for 1,000 items the new AVX512 sort is 13x faster than vqsort.
As you read it, you'll see that the vqsort issue for small arrays has been fixed, and as of a few weeks ago or so, vqsort is now faster than the AVX512 sort for random.
I worked in embedded development and sorting large lists of files was a surprisingly big bottleneck for many of our projects because of the very slow microcontrollers. Even worse we couldn't cache the results because of no memory.
So I was tasked to improve that and I had to write inline assembly to abuse specific cpu instructions that could effectively do much more char comparisons per clock cycle. We ended up not using it because of the usual reasons (not portable, hard to maintain) and our customers had to live with the 20-30s delay when entering a large directory, versus the 5-6s my code achieved.
(Sorry this has little to do with the topic at hand, I don't really know when sorting becomes a problem on a multicore 5ghz cpu)
Sorting is pretty common in the numerics world because a lot of algorithms or techniques can be optimized heavily with sorted inputs. You either get to skip steps or bisect the dataset. Sort of like how most fast fft implementations will run 10-20% faster if you pad vectors to reach a power of 2 length. A typical "preprocess pipeline" would involve splitting vectors into power of 2 sizes or padding them to maximize cache lines, normalizing (and often mapping to an integer domain), and sorting.
The client CPUs released until present do not have AVX-512 sadly. I'm not planning any hardware updates for the next 3 years at least. Not relevant for me at least.
The 512 bit registers are the least important part of AVX-512. The much more important parts are the 32 registers, mask registers (these are great for quicksort), compressed displacement (which shrinks the instruction size for unrolled code), and a bunch of other goodies.
It can often be more than you would expect from the width increase, as lots of masking tools and instructions allow you to vectorize things more efficiently than could be done before.
[+] [-] cientifico|2 years ago|reply
My conclusion (feel free to enlighten me if I am wrong) is that a system will profit by having more cores instead of AVX-512 for the same power consumption.
[+] [-] janwas|2 years ago|reply
It is time to stop quoting this rant, which is rather far removed from actual practice and reality.
Specifically for sorting, Intel's sort and ours can be about 10x as fast as scalar.
AVX-512 has high power? Great! Power is work per time. Let's have more of that. It is _useful_ work per _energy spent_ that we want to maximize, and there scalar code falls flat. Consider that OoO scheduling/control overhead is the real culprit; executing one instruction costs far more energy than the actual operation. SIMD divides that fixed cost over 16 elements, leading to 5-10x higher power efficiency.
Top frequency reduction? Not since the first implementation on Skylake, and even there a non-issue, see https://github.com/google/highway/blob/master/hwy/contrib/so....
More cores? The shared-memory fabric is already bursting at the seams (see on-chip bottlenecks of Genoa), we are already cramming in too many for the current memory interface.
What would actually be necessary: more bandwidth! A64FX CPUs actually beat GPUs in 2019 for supercomputing applications thanks to their HBM. Intel's Sappire Rapids Max also has this. Let's build more of those, at a decent price. And start using the wide vector units, they are there precisely because lots of highly-clocked cores in one big NUMA domain is not always the best way forward.
[+] [-] gmokki|2 years ago|reply
Then 4 years later AMD has the option to bump to full-speed AVX-512 if it is really needed. This is the same they did with AVX2 initially.
[+] [-] jackmott42|2 years ago|reply
You could probably find a workload where your conclusion is correct but I think the vast majority of workloads would be faster with AVX-512 if you have the time to leverage it.
[+] [-] CyberRage|2 years ago|reply
Some workloads can be accelerated via AVX-512 as shown here by Anandtech:
https://www.anandtech.com/show/17601/intel-core-i9-13900k-an...
See how AMD CPUs with AVX-512 enabled some a massive boost even with similar/less cores
I would agree that most typical workloads don't benefit much from AVX-512, it requires software support and good use-case(wide parallel SIMD)
[+] [-] formerly_proven|2 years ago|reply
[+] [-] NohatCoder|2 years ago|reply
Tldr it is not that it isn't useful, but the 512 bit part makes implementation prohibitively expensive.
[+] [-] qayxc|2 years ago|reply
[+] [-] sagarm|2 years ago|reply
[+] [-] eesmith|2 years ago|reply
When those benchmarks were first done, for random input, vqsort was faster once you get around ~20,000 items, but for 1,000 items the new AVX512 sort is 13x faster than vqsort.
As you read it, you'll see that the vqsort issue for small arrays has been fixed, and as of a few weeks ago or so, vqsort is now faster than the AVX512 sort for random.
[+] [-] jaxrtech|2 years ago|reply
[+] [-] daveoc64|2 years ago|reply
https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...
The data set is a mixed bag though. Some types of system are going to be over-represented.
[+] [-] tecleandor|2 years ago|reply
By my lack of knowledge (and the heat of today :0 ) can't guess what algorithms or processes do lots of sorting behind the scenes.
[+] [-] tredre3|2 years ago|reply
So I was tasked to improve that and I had to write inline assembly to abuse specific cpu instructions that could effectively do much more char comparisons per clock cycle. We ended up not using it because of the usual reasons (not portable, hard to maintain) and our customers had to live with the 20-30s delay when entering a large directory, versus the 5-6s my code achieved.
(Sorry this has little to do with the topic at hand, I don't really know when sorting becomes a problem on a multicore 5ghz cpu)
[+] [-] eyegor|2 years ago|reply
[+] [-] djxfade|2 years ago|reply
[+] [-] cmrdporcupine|2 years ago|reply
Anything using an ordered data structure -- btrees, red-black tree etc maps and sets, they're doing sorting.
Which, well, databases. Huge one.
[+] [-] fulafel|2 years ago|reply
[+] [-] vkaku|2 years ago|reply
[+] [-] hedora|2 years ago|reply
Also, it’s unfortunate that this stuff won’t run on most developer machines.
[+] [-] janwas|2 years ago|reply
Disclosure: I am the main author of vqsort, which is also a vectorized Quicksort but portable.
[+] [-] adgjlsfhk1|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] mgaunard|2 years ago|reply
[+] [-] jackmott42|2 years ago|reply