This is a list of articles—probably a good one, but HN is itself a list of articles, so this is too much indirection.
Lists don't make good HN submissions, because the only thing to discuss about them is the lowest common denominator of the items on the list [1], leading to generic discussion, which isn't as interesting as specific discussion [2].
It's better to pick the most interesting item from the list and submit that. You can always do it more than once, if there is more than one interesting item—but it's best to wait a while between such submissions, to let the hivemind caches clear.
SIMD is used a ton in rendering applications and starting to see more use in games too (through ISPC for example).
I'd add to the list:
- Embree: https://www.embree.org/ Open source high-performance ray tracing kernels for CPUs using SIMD.
- OpenVKL: https://www.openvkl.org/ Similar to Embree (high-performance ray tracing kernels), but for volume traversal and sampling.
- ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code
- OSPRay: http://www.ospray.org/ A large project using SIMD throughout (via ISPC) for real time ray tracing for scientific visualization and physically based rendering.
- Open Image Denoise: https://openimagedenoise.github.io/ An open-source image denoiser using SIMD (via ISPC) for some image processing and denoising.
- (my own project) ChameleonRT: https://github.com/Twinklebear/ChameleonRT has an Embree + ISPC backend, using Embree for SIMD ray traversal and ISPC for vectorizing the rest of the path tracer (shading, texture sampling).
Starting to see? Back in Ye Olde 586 Days of the late 1990s, MMX was added to the Pentium architecture pretty much exclusively for 3D games and real-time audio/video decoding. (This was back when the act of playing an MP3 was no small chore for the average consumer CPU.) Intel made quite a big deal over MMX including millions of dollars in TV ads aimed at the general population, despite the fact that software had to be built specifically to use MMX and that only certain kinds of software could benefit from it.
> ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code
I've been learning ispc lately and it does seem like a wonderful solution, you avoid having to build separate implementations for every instruction set and/or worrying about per-compiler massaging to get it to recognise the vectorisation opportunities. The arguments for having a domain-specific language variant and why it was written (https://pharr.org/matt/blog/2018/04/30/ispc-all.html is a good read) seem like persuasive arguments.
However, outside of the projects in the above list - it doesn't seem to have very wide usage. There are still commits coming in/responding to some issues so it doesn't seem dead, but there are many issues untouched or just untriaged. There isn't much discussion about using it, or people asking for advice. The mailing list has about a message a month.
Is it merely just an extremely highly specialised domain? Is it just that CUDA/OpenCL is a more efficient solution for most cases where one would otherwise consider it? Are there too many ASM/intrinsic experts out there to bother learning?
I try not to include C or C++ projects other than for educational purpose (like the Mandelbrot set) because one of my life's goal is to help the world to transition to a C & C++ free world (other than for kernels...).
I believe that my role is to promote projects which are "building the new world" and thus we need to abandon and port all form insecure core.
ripgrep does, and it's a big reason why it edges out GNU grep in a lot of common cases, especially for case insensitive searches. The most significant use of SIMD is the Teddy algorithm, which I copied from the Hyperscan project. I wrote up how it works here: https://github.com/BurntSushi/aho-corasick/blob/66f581583b69...
A Common Lisp project that uses SIMD (specifically AVX2) is the Quantum Virtual Machine [1]. It’s a quantum computer simulator. Here [2] is part of the source that has the SIMD instructions.
It’s cool that with using SBCL, an implementation of Common Lisp, you can write compartmentalized assembly very easily in an otherwise extremely high-level language.
The megahertz-scaling "Free Lunch" was declared dead 15 years ago [http://www.gotw.ca/publications/concurrency-ddj.htm] and it's been only getting deader. People are finally, grudgingly accepting that they must go parallel unless we want to see software performance stagnate permanently. For most people here, the issue has been obvious since before they learned to program. But, still they are putting off learning how to deal with it. The first, obvious answer to that is threading. But, in my experience, SIMD is a bigger bang for the buck for two reasons: 1) No synchronization problems. 2) Better cache utilization. It's not just that SIMD forces you to work in large, contiguous blocks. Fun fact: When you aren't using SIMD you are only using a fraction of your L1 cache bandwidth!
A big challenge is that SIMD intrinsic-function APIs are weird. They have inscrutable function names and sometimes difficult semantics. What helped me greatly was going through the effort of writing #define wrappers for myself that just gave each function in SSE1-3 names that made sense to me. I don't expect many people to put in that effort. And, unfortunately, I don't have go-to recommendations for pre-existing libraries. Best I can do is:
https://github.com/VcDevel/Vc is working on being standardized into C++. It's great for processing medium-to-large arrays.
https://github.com/microsoft/DirectXMath is not actually tied to DirectX. It's has a huge library of small-vector linear algebra (3D graphics math) function. It used to be pretty tied to MS's compiler. But, I believe they've been cleaning it up to be cross compiler lately.
Can you say more about non SIMD instructions not making full use of the L1 bandwidth? Is it just that even keeping all the integer units busy still doesn't equate to using all the bandwidth? I suppose that makes sense when adding up the numbers for clock cycles and bytes. I'm guessing this not common to point out since being limited to L1 cache bandwidth is so unlikely to be a program's main bottleneck.
They can, but as explained in one of the article (by Cloudflare, "On the dangers of Intel's frequency scaling") SIMD in a multithreaded environment can cause performance problems due to CPU throttling.
So generally SIMD are used for single thread algorithms.
https://arrow.apache.org/
> Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single instruction, multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.
Pretty much every neural network framework is aggressively SIMD-optimized (after all, that's kind of the point besides autodiff), not sure why Tencent's framework is picked..
If you know about it, I want to hear about more fast SIMD-based CLI tools that can replace my existing workflow (e.g. burntsushi's ripgrep or xsv).
Exactly this! I am glad this list exists but even more interesting question is what are the reasons for the list like that to exist? Ideally it is up to the compiler to user target architecture to its maximum potential.
Every item on this list is a library, compiler optimization, or an idiomatic abstraction waiting to happen.
GPUs are SIMD machines, does that color whether it seems rare? Coming from a graphics background and working on GPUs, I am super biased, but I say it's very worth the hassle.
The easiest intro IMO is check out & play around in ShaderToy. You don't have to know much about SIMD to write shaders, but once you start paying attention to how the machine works, you can make it go really fast.
In general, using something like CUDA is similar to C++ you just have to make sure all your threads do as close to the same thing as possible in order to see good perf.
If you are just hearing about it, there's a good chance that you're not doing any kind of elaborate matrix math, so it's tough to say if it's useful.
It's incredible useful if you're doing a lot of stuff involving matrices (graphics, image/video processing, neural-net stuff, stuff like that). If you have any interest in that topic, it's absolutely worth learning how to use SIMD (or at least learning a library that takes advantage of SIMD in your language of choice).
It's certainly useful for quite a lot of mathematical science code. In my experience compilers are not very good at autovectorizing anything but the most simple loops and writing SIMD intrinsics is necessary to obtain the maximum output of the processor.
I was wondering generally, is SIMD a good idea for general purpose CPUs. Imagine if the current high end CPUs had double the number of cores, no SIMD, but possibly higher frequency and the algorithms that benefit from SIMD were all run on integrated accelerators instead.
At least as a side observer it looks like a huge number of very large registers take large portions of a core, for sure consuming a lot of power as well, just to sit idle while the core is running JavaScript. Can somebody with CPU architecture experience say what is the real tradeoff here.
Adding SIMD takes less space than adding cores and the use case where you need double the cores on a many core chip but aren't doing the same thing many times is pretty rare.
SIMD units don't need to consume power or limit the frequency of the rest of the chip while not being used Same as when JavaScript is running on one boosted core and the other 63 powersave. While being used SIMD units are more efficient than running 2x or 4x entire cores just to get the additional operation per clock.
Reminder that nothing is a panacea: I've heard from game engine authors and cryptographers that on Intel chips _over-using_ SIMD can actually heat up the chips too much such that it'll cause the system to then adjust the clockrate lower to cool down and you can degrade performance beyond not using SIMD at all. Before hearing that, I had never considered thermal properties of particular instructions.
It's not a problem for SSE and AVX1. But, with AVX2/AVX-512, the deal is that you should not just dip your toe with an occasional call to a small SIMD task using such heavy-hitting features. Either do enough SIMD work to overcome the down-clock, or use a lower-end SIMD functionality for smaller tasks.
And, even within AVX2/512 there are huge sets of added functionality that are really "AVX1-enhanced" without going wider. Those are fine to use to without worrying about downclocking.
“Intel cores can run in one of three modes: license 0 (L0) is the fastest (and is associated with the turbo frequencies written on the box), license 1 (L1) is slower and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle. Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.”
Be careful to benchmark real loads as there are perverse interactions e.g. “Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).“ so a function using AVX512 can affect the speed of unrelated code (similar to thermal throttling).
JVM does it to an extremely limited extent. Anything jitted, doesn't have a lot of time to do autovectorization. Even GCC/llvm are pretty limited at this, as it is just a hard problem, and doing it with floating point is problematic as it usually changes the result.
[+] [-] dang|6 years ago|reply
Lists don't make good HN submissions, because the only thing to discuss about them is the lowest common denominator of the items on the list [1], leading to generic discussion, which isn't as interesting as specific discussion [2].
It's better to pick the most interesting item from the list and submit that. You can always do it more than once, if there is more than one interesting item—but it's best to wait a while between such submissions, to let the hivemind caches clear.
[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
[+] [-] Twinklebear|6 years ago|reply
I'd add to the list:
- Embree: https://www.embree.org/ Open source high-performance ray tracing kernels for CPUs using SIMD.
- OpenVKL: https://www.openvkl.org/ Similar to Embree (high-performance ray tracing kernels), but for volume traversal and sampling.
- ISPC: https://ispc.github.io/ an open source compiler for a SPMD language which compiles it to efficient SIMD code
- OSPRay: http://www.ospray.org/ A large project using SIMD throughout (via ISPC) for real time ray tracing for scientific visualization and physically based rendering.
- Open Image Denoise: https://openimagedenoise.github.io/ An open-source image denoiser using SIMD (via ISPC) for some image processing and denoising.
- (my own project) ChameleonRT: https://github.com/Twinklebear/ChameleonRT has an Embree + ISPC backend, using Embree for SIMD ray traversal and ISPC for vectorizing the rest of the path tracer (shading, texture sampling).
[+] [-] bityard|6 years ago|reply
Starting to see? Back in Ye Olde 586 Days of the late 1990s, MMX was added to the Pentium architecture pretty much exclusively for 3D games and real-time audio/video decoding. (This was back when the act of playing an MP3 was no small chore for the average consumer CPU.) Intel made quite a big deal over MMX including millions of dollars in TV ads aimed at the general population, despite the fact that software had to be built specifically to use MMX and that only certain kinds of software could benefit from it.
[+] [-] misnome|6 years ago|reply
I've been learning ispc lately and it does seem like a wonderful solution, you avoid having to build separate implementations for every instruction set and/or worrying about per-compiler massaging to get it to recognise the vectorisation opportunities. The arguments for having a domain-specific language variant and why it was written (https://pharr.org/matt/blog/2018/04/30/ispc-all.html is a good read) seem like persuasive arguments.
However, outside of the projects in the above list - it doesn't seem to have very wide usage. There are still commits coming in/responding to some issues so it doesn't seem dead, but there are many issues untouched or just untriaged. There isn't much discussion about using it, or people asking for advice. The mailing list has about a message a month.
Is it merely just an extremely highly specialised domain? Is it just that CUDA/OpenCL is a more efficient solution for most cases where one would otherwise consider it? Are there too many ASM/intrinsic experts out there to bother learning?
[+] [-] apjana|6 years ago|reply
It takes advantage of SIMD at -O3 level of optimization in it's custom string copy function: https://github.com/jarun/nnn/blob/bc7a81921ed974a408d4de2cbf...
The function is used extensively in the program.
[+] [-] z0mbie42|6 years ago|reply
I try not to include C or C++ projects other than for educational purpose (like the Mandelbrot set) because one of my life's goal is to help the world to transition to a C & C++ free world (other than for kernels...).
I believe that my role is to promote projects which are "building the new world" and thus we need to abandon and port all form insecure core.
[+] [-] burntsushi|6 years ago|reply
[+] [-] reikonomusha|6 years ago|reply
It’s cool that with using SBCL, an implementation of Common Lisp, you can write compartmentalized assembly very easily in an otherwise extremely high-level language.
[1] https://github.com/rigetti/qvm
[2] https://github.com/rigetti/qvm/blob/master/src/impl/sbcl-avx...
[+] [-] corysama|6 years ago|reply
A big challenge is that SIMD intrinsic-function APIs are weird. They have inscrutable function names and sometimes difficult semantics. What helped me greatly was going through the effort of writing #define wrappers for myself that just gave each function in SSE1-3 names that made sense to me. I don't expect many people to put in that effort. And, unfortunately, I don't have go-to recommendations for pre-existing libraries. Best I can do is:
https://github.com/VcDevel/Vc is working on being standardized into C++. It's great for processing medium-to-large arrays.
https://ispc.github.io/ is great for writing large, complicated SIMD features.
https://github.com/microsoft/DirectXMath is not actually tied to DirectX. It's has a huge library of small-vector linear algebra (3D graphics math) function. It used to be pretty tied to MS's compiler. But, I believe they've been cleaning it up to be cross compiler lately.
[+] [-] CyberDildonics|6 years ago|reply
[+] [-] TazeTSchnitzel|6 years ago|reply
They are not alternatives to eachother, they are orthogonal things, unless you're using a GPU.
[+] [-] z0mbie42|6 years ago|reply
So generally SIMD are used for single thread algorithms.
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] poorman|6 years ago|reply
https://arrow.apache.org/ > Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single instruction, multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.
[+] [-] singhrac|6 years ago|reply
If you know about it, I want to hear about more fast SIMD-based CLI tools that can replace my existing workflow (e.g. burntsushi's ripgrep or xsv).
[+] [-] z0mbie42|6 years ago|reply
I picked Chinese technology because they are rarely promoted but really great.
Regarding the CLI tools it's a great question and I have opened a ticket for a future issue: https://gitlab.com/bloom42/open_source_weekly/-/issues/14
[+] [-] nickysielicki|6 years ago|reply
I wrote this a couple days ago: https://sielicki.github.io/posts/playing-around-with-autovec...
[+] [-] mynegation|6 years ago|reply
Every item on this list is a library, compiler optimization, or an idiomatic abstraction waiting to happen.
[+] [-] gameswithgo|6 years ago|reply
For Rust/C/C++: https://www.youtube.com/watch?v=4Gs_CA_vm3o
For C#: https://www.youtube.com/watch?v=8RcjQPbvvRU
[+] [-] dmos62|6 years ago|reply
[+] [-] dahart|6 years ago|reply
The easiest intro IMO is check out & play around in ShaderToy. You don't have to know much about SIMD to write shaders, but once you start paying attention to how the machine works, you can make it go really fast.
In general, using something like CUDA is similar to C++ you just have to make sure all your threads do as close to the same thing as possible in order to see good perf.
[+] [-] tombert|6 years ago|reply
It's incredible useful if you're doing a lot of stuff involving matrices (graphics, image/video processing, neural-net stuff, stuff like that). If you have any interest in that topic, it's absolutely worth learning how to use SIMD (or at least learning a library that takes advantage of SIMD in your language of choice).
[+] [-] xioxox|6 years ago|reply
[+] [-] haolez|6 years ago|reply
> Java 8 64-bit. We recommend Oracle Java 8, but OpenJDK8 will also work (although a little slower).
Anyone have an idea why?
[1]https://github.com/questdb/questdb
[+] [-] veselin|6 years ago|reply
At least as a side observer it looks like a huge number of very large registers take large portions of a core, for sure consuming a lot of power as well, just to sit idle while the core is running JavaScript. Can somebody with CPU architecture experience say what is the real tradeoff here.
[+] [-] zamadatix|6 years ago|reply
SIMD units don't need to consume power or limit the frequency of the rest of the chip while not being used Same as when JavaScript is running on one boosted core and the other 63 powersave. While being used SIMD units are more efficient than running 2x or 4x entire cores just to get the additional operation per clock.
[+] [-] jzelinskie|6 years ago|reply
[+] [-] corysama|6 years ago|reply
And, even within AVX2/512 there are huge sets of added functionality that are really "AVX1-enhanced" without going wider. Those are fine to use to without worrying about downclocking.
[+] [-] robocat|6 years ago|reply
Great info from here: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...
Be careful to benchmark real loads as there are perverse interactions e.g. “Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).“ so a function using AVX512 can affect the speed of unrelated code (similar to thermal throttling).
[+] [-] saagarjha|6 years ago|reply
[+] [-] GordonS|6 years ago|reply
[+] [-] tarr11|6 years ago|reply
https://issues.apache.org/jira/browse/LUCENE-9027
[+] [-] FZ1|6 years ago|reply
Or maybe this is limited to little personal projects, and not major libraries ?
[+] [-] truth_seeker|6 years ago|reply
[+] [-] gameswithgo|6 years ago|reply
[+] [-] andrea_s|6 years ago|reply
[+] [-] vmchale|6 years ago|reply
https://github.com/vmchale/ats-codecount/blob/master/DATS/wc...