top | item 43143338

(no title)

Another resource on the same topic: https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...

As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.

FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.

dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.

While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.

I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.

discuss

cornstalks|1 year ago

One of the fun things about dav1d is that since it’s written in assembly, they can use their own calling convention. And it can differ from method to method, so they have very few stack stores and loads compared to what a compiler will generate following normal platform calling conventions.

janwas|1 year ago

I'm curious why there are even function calls in time-critical code, shouldn't just about everything be inlined there? And if it's not time-critical, why are we interested in the savings from a custom calling convention?

MortyWaves|1 year ago

Doesn’t this just make it harder to maintain ports to other architectures though?

janwas|1 year ago

I'm also in the mission-critical camp, with perhaps an interesting counterpoint. If we're focusing on small details (or drowning in incidental complexity), it can be harder to see algorithmic optimizations. Or the friction of changing huge amounts of per-platform code can prevent us from escaping a local minimum.

Example: our new matmul outperforms a well-known library for LLM inference, sometimes even if it uses AMX vs our AVX512BF16. Why? They seem to have some threading bottleneck, or maybe it's something else; hard to tell with a JIT involved.

This would not have happened if I had to write per-platform kernels. There are only so many hours in the day. Writing a single implementation using Highway enabled exploring more of the design space, including a new kernel type and an autotuner able to pick not only block sizes, but also parallelization strategies and their parameters.

Perhaps in a second step, one can then hand-tune some parts, but I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

rbultje|1 year ago

> I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.

It should be obvious that both are pursued independently whenever it makes sense. The idea that one should precede the other or is more important than the other is simply untrue.

dundarious|1 year ago

What does Zig offer in the way of builtin SIMD support, beyond overloads for trivial arithmetic operations? 90% of the utility of SIMD is outside of those types of simple operations. I like Zig, but my understanding is you have to reach for CPU specific builtins for the vast majority of cases, just like in C/C++.

GCC and Clang support the vector_size attribute and overloaded arithmetic operators on those "vectorized" types, and a LOT more besides -- in fact, that's how intrinsics like _mm256_mul_ps are implemented: `#define _mm256_mul_ps(a,b) (__m256)((v8sf)(a) * (v8sf)(b))`. The utility of all of that is much, much greater than what's available in Zig.

anonymoushn|1 year ago

Zig ships LLVM's internal generic SIMD stuff, which is fairly common for newish systems languages. If you want dynamic shuffles or even moderately exotic things like maddubs or aesenc then you need to use LLVM intrinsics for specific instructions or asm.

MortyWaves|1 year ago

I’m also wondering what “built in” even means. Many have SIMD, Vector, Matrix, Quaternions and the like as part of the standard library, but not necessarily as their own keywords. C#/.NET, Java has SIMD by this metric.

zbobet2012|1 year ago

So on point. We do _a lot_ of hand written SIMD on the other side (encoders) as well for similar reasons. In addition on the encoder side it's often necessary to "structure" the problem so you can perform things like early elimination of loops, and especially loads. Compilers simply can not generate autovectorized code that does those kinds of things.