top | item 36452596

Intel Releases x86-SIMD-sort 2.0 With Faster AVX-512 Sorting, New Algorithms

120 points| ashvardanian | 2 years ago |phoronix.com

56 comments

[+] cientifico|2 years ago|reply

A interesting reading related to this is the thoughts of Linus about AVX-512 back in 2020: https://www.phoronix.com/news/Linus-Torvalds-On-AVX-512

My conclusion (feel free to enlighten me if I am wrong) is that a system will profit by having more cores instead of AVX-512 for the same power consumption.

[+] janwas|2 years ago|reply

(Frequent AVX-512 user, opinions are my own.)

It is time to stop quoting this rant, which is rather far removed from actual practice and reality.

Specifically for sorting, Intel's sort and ours can be about 10x as fast as scalar.

AVX-512 has high power? Great! Power is work per time. Let's have more of that. It is _useful_ work per _energy spent_ that we want to maximize, and there scalar code falls flat. Consider that OoO scheduling/control overhead is the real culprit; executing one instruction costs far more energy than the actual operation. SIMD divides that fixed cost over 16 elements, leading to 5-10x higher power efficiency.

Top frequency reduction? Not since the first implementation on Skylake, and even there a non-issue, see https://github.com/google/highway/blob/master/hwy/contrib/so....

More cores? The shared-memory fabric is already bursting at the seams (see on-chip bottlenecks of Genoa), we are already cramming in too many for the current memory interface.

What would actually be necessary: more bandwidth! A64FX CPUs actually beat GPUs in 2019 for supercomputing applications thanks to their HBM. Intel's Sappire Rapids Max also has this. Let's build more of those, at a decent price. And start using the wide vector units, they are there precisely because lots of highly-clocked cores in one big NUMA domain is not always the best way forward.

[+] gmokki|2 years ago|reply

AMD enabled AVX-512 without increasing their power consumption. To do that their AVX-512 runs only half of max speed compared Intel when using 512 bit registers (aka max flops is same as with AVX2 256 bit registers). Still, because the new instructions in AVX-512 are beneficial in many scenarios the actual speed up is still often 10-20% in code that benefits from the new instructions.

Then 4 years later AMD has the option to bump to full-speed AVX-512 if it is really needed. This is the same they did with AVX2 initially.

[+] jackmott42|2 years ago|reply

A lot of the downclocking issues he was talking about then are less severe now on newer Intel cpus and AMD cpus, which changes the calculus a lot.

You could probably find a workload where your conclusion is correct but I think the vast majority of workloads would be faster with AVX-512 if you have the time to leverage it.

[+] CyberRage|2 years ago|reply

heavily depends on the workload

Some workloads can be accelerated via AVX-512 as shown here by Anandtech:

https://www.anandtech.com/show/17601/intel-core-i9-13900k-an...

See how AMD CPUs with AVX-512 enabled some a massive boost even with similar/less cores

I would agree that most typical workloads don't benefit much from AVX-512, it requires software support and good use-case(wide parallel SIMD)

[+] formerly_proven|2 years ago|reply

"HPC doesn't exist and also FP performance is irrelevant" seems a wildly out of touch take from Linus.

[+] NohatCoder|2 years ago|reply

Fabian Giesen has a decent rant on AVX512. https://mastodon.gamedev.place/@rygorous/110572829749524388

Tldr it is not that it isn't useful, but the 512 bit part makes implementation prohibitively expensive.

[+] qayxc|2 years ago|reply

I won't say you're wrong, but the answer might depend heavily on the application.

[+] sagarm|2 years ago|reply

Faster, but still 15-25% slower than Highway's vqsort according to my somewhat amateur tests:

  ------------------------------------------------------------
  Benchmark                        Time     Bytes    Items
  ------------------------------------------------------------
  Hwy/random                     2.16 s  474.1M/s  62.1M/s
  Hwy/sorted_runs                2.29 s  446.7M/s  58.5M/s
  IntelX86SIMD/random            2.90 s  353.9M/s  46.3M/s
  IntelX86SIMD/sorted_runs       2.68 s  382.4M/s  50.1M/s

https://github.com/funrollloops/parallel-sort-bench

[+] eesmith|2 years ago|reply

See also "10~17x faster than what? A performance analysis of Intel's x86-simd-sort (AVX-512)" at https://github.com/Voultapher/sort-research-rs/blob/main/wri... posted here at https://news.ycombinator.com/item?id=36273544 and minor additional comments at https://news.ycombinator.com/item?id=36268877 .

When those benchmarks were first done, for random input, vqsort was faster once you get around ~20,000 items, but for 1,000 items the new AVX512 sort is 13x faster than vqsort.

As you read it, you'll see that the vqsort issue for small arrays has been fixed, and as of a few weeks ago or so, vqsort is now faster than the AVX512 sort for random.

[+] jaxrtech|2 years ago|reply

Is there an equivalent to caniuse.com for gauging the level of instruction set extension support on x86 and ARM in the wild?

[+] daveoc64|2 years ago|reply

The Steam Hardware Survey "Other Settings" section lists the prevalence of various extensions among users that have taken the survey:

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

The data set is a mixed bag though. Some types of system are going to be over-represented.

[+] tecleandor|2 years ago|reply

What kind of processes are heavily (or, at least, moderately) impacted by the performance improvement of sorting algorithms?

By my lack of knowledge (and the heat of today :0 ) can't guess what algorithms or processes do lots of sorting behind the scenes.

[+] tredre3|2 years ago|reply

I worked in embedded development and sorting large lists of files was a surprisingly big bottleneck for many of our projects because of the very slow microcontrollers. Even worse we couldn't cache the results because of no memory.

So I was tasked to improve that and I had to write inline assembly to abuse specific cpu instructions that could effectively do much more char comparisons per clock cycle. We ended up not using it because of the usual reasons (not portable, hard to maintain) and our customers had to live with the 20-30s delay when entering a large directory, versus the 5-6s my code achieved.

(Sorry this has little to do with the topic at hand, I don't really know when sorting becomes a problem on a multicore 5ghz cpu)

[+] eyegor|2 years ago|reply

Sorting is pretty common in the numerics world because a lot of algorithms or techniques can be optimized heavily with sorted inputs. You either get to skip steps or bisect the dataset. Sort of like how most fast fft implementations will run 10-20% faster if you pad vectors to reach a power of 2 length. A typical "preprocess pipeline" would involve splitting vectors into power of 2 sizes or padding them to maximize cache lines, normalizing (and often mapping to an integer domain), and sorting.

[+] djxfade|2 years ago|reply

Databases would be me suggestion

[+] cmrdporcupine|2 years ago|reply

What isn't impacted by sorting algorithms would be more of the question?

Anything using an ordered data structure -- btrees, red-black tree etc maps and sets, they're doing sorting.

Which, well, databases. Huge one.

[+] fulafel|2 years ago|reply

3D graphics.

[+] vkaku|2 years ago|reply

The client CPUs released until present do not have AVX-512 sadly. I'm not planning any hardware updates for the next 3 years at least. Not relevant for me at least.

[+] hedora|2 years ago|reply

Are these significantly faster than the 256 bit versions? They compare to non-simd, but that’s less interesting.

Also, it’s unfortunate that this stuff won’t run on most developer machines.

[+] janwas|2 years ago|reply

Yes, can be about 1.6x speedup: https://github.com/Voultapher/sort-research-rs/blob/main/wri... (cpp_intel_avx512).

Disclosure: I am the main author of vqsort, which is also a vectorized Quicksort but portable.

[+] adgjlsfhk1|2 years ago|reply

The 512 bit registers are the least important part of AVX-512. The much more important parts are the 32 registers, mask registers (these are great for quicksort), compressed displacement (which shrinks the instruction size for unrolled code), and a bunch of other goodies.

[+] unknown|2 years ago|reply

[deleted]

[+] mgaunard|2 years ago|reply

You don't need avx512 to implement a bitonic sort. I worked on several implementations with plain avx.

[+] jackmott42|2 years ago|reply

It can often be more than you would expect from the width increase, as lots of masking tools and instructions allow you to vectorize things more efficiently than could be done before.