AI’s compute fragmentation: what matrix multiplication teaches us

[+] BenoitP|3 years ago|reply

There's hope in intermediate representations, in OpenXLA:

https://opensource.googleblog.com/2023/03/openxla-is-ready-t...

> OpenXLA is an open source ML compiler ecosystem co-developed by AI/ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware

[+] junrushao1994|3 years ago|reply

One thing I really love about XLA is GSPMD which effectively allows scalable distributed training in practice. However, I was quite curious how it is related to matrix multiplication though, given XLA is more focusing on graph-level optimization and basically offloads matmul to other libraries like Triton and cuBLAS

[+] brrrrrm|3 years ago|reply

> Hand-written assembly kernels don’t scale!

I used to think this. And I think, in theory, it is true. But the fact of the matter is, modern ML just doesn't use that many kernels. Every framework uses the same libraries (BLAS) and every library uses the same basic idea (maximally saturate FMA-like units).

Large language models are being run natively on commodity hardware with code written from scratch within days of their release (e.g. llama.cpp).

From a conceptual standpoint, it's really easy to saturate hardware in this domain. It's been pretty easy since 2014 when convolutions were interpreted as matrix multiplications. Sure, the actual implementations can be tricky, but a single engineer (trained in it) can get that done for a specific hardware in a couple months.

Of course, the interesting problem is how to generalize kernel generation. I spent years working with folks trying to do just that. But, in retrospect, the actual value add from a system that does all this for you is quite low. It's a realization I've been struggling to accept :'(

[+] tysam_and|3 years ago|reply

We....do use kernels and kernel generation in the ML field. Every day. I'm confused as to a few of the points made here, unless the first sentence is meaning '...doesn't use that many hand-written kernels'.

All of the convolutions we have run on kernels, some pre-built and customized/chosen from a list based on performance, and some dynamically generated. PyTorch 2.0 for example decomposes and fuses operations, then uses OpenAI's Triton to dynamically generate a custom fused kernel that tends to be very efficient.

There are still hand-written kernels, even, Flash-Attention and the Memory-Efficient attention papers both caused huge leaps forward because they manually went through a lot of the inefficiencies of naive matrix multiplies for attention w.r.t. the hardware design and optimized it quite a lot.

I think generalized kernel generation though may have more life in it than you might suspect! It is a fascinating field and I do not know nearly enough about it. I hope someday to be able to write my own Triton kernels/get to know how it integrates as a dynamic compiler for PyTorch code. We certainly live in wild times. Crazy indeed.

[+] tzhenghao|3 years ago|reply

Good points, but I'd push back just a little.

> Sure, the actual implementations can be tricky, but a single engineer (trained in it) can get that done for a specific hardware in a couple months.

I want to agree with you on this, but in practice, it's...

1. Hard to hire that engineer with deep expertise in handwritten kernels. CUDA engineers are still hard to come by and doesn't scale with productionized AI engineering demand.

2. "A few months" is a tough pill to swallow from an engineering roadmap POV, especially when models are deployed on a monthly basis. Most of the hand tuning efforts aren't scalable and will have to be done again on most iterations. This is especially true in reinforcement learning and robotics.

> But, in retrospect, the actual value add from a system that does all this for you is quite low. It's a realization I've been struggling to accept.

Yeah I remain neutral on this. On one hand, I can see that especially having to invest significant engineering effort (see point 2 above). On the other hand, you won't really know until you start benchmarking these models (and as you should).

[+] nitwit005|3 years ago|reply

> "Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task."

By committing it to a common library that a lot of people use? There are already multiple libraries with optimized matrix multiplication.

This is also exaggerating the expertise required. I'm not going to claim it's trivial, but you can genuinely google "intel avx-512 matrix multiplication", and find both papers and Intel samples.

[+] photochemsyn|3 years ago|reply

> "Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task."

Naively, I wonder if this is the kind of problem that AI itself can solve, which is a rather singularity-approaching concept. Maybe there's too much logic involved and not enough training data on different configurations for that to work? A bit spooky however, the thought of self-bootstrapping AI.

[+] dimatura|3 years ago|reply

There has been work on using AI for this at various levels - at the neural architecture level (finding neural architectures with high throughput/latency for a given hardware), at the algorithm level (finding faster matrix multiplication routines), and at the hardware level (iirc Google stated the latest version of google TPUs were partially designed with AI).

[+] bigbillheck|3 years ago|reply

This is the kind of problem AI's been solving for 25 years and more: https://www.fftw.org

[+] junrushao1994|3 years ago|reply

My take: optimizing matrix multiplication is not hard on modern architecture if you have the right abstraction. The code itself could be fragmented across different programming models, which is true, but the underlying techniques are not hard for a 2nd/3rd year undergrad to understand. There are only a few important ones on GPU: loop tiling, pipelining, shared memory swizzle, memory coalescing. A properly designed compiler can allow developers to optimize matmuls within 100 lines of code.

[+] touisteur|3 years ago|reply

Looking at the effort plunked into things like cutlass and them still not reaching cuBLAS perf (which very few can beat - in the places where cuBLAS shines! which is... not that many...), and even in cuDNN and they're still eeking out single digit improvements regularly, I'd say this is probably harder than that. At least if you're reaching for the >50% use of the 37 TFLOPS of an A40. If you're fine throwing more GPUs at the problem, sure.

Edit: I mean when you still see papers every year with large improvements in perf, and things like 'we used tensor cores and managed to get back fp32 accuracy with 3 rounds of the things' - what? - I can attest it doesn't take 2 weeks to get this kind of results. And it's just getting started on tensor cores! And when on the nvidia forums someone says 'nah probably no improvement to use tensor cores for fft' and you get a link with a paper with a significative improvement in perf using tensor cores, I say we're just starting.

[+] mathisfun123|3 years ago|reply

> A properly designed compiler can allow developers to optimize matmuls within 100 lines of code.

man this is such a funny closing comment - what exactly do you think is involved in designing a compiler that enables devs to optimize matmuls if not 1000s of person hours/years/etc of very "fine-grained" perf research? what the "abstraction" people don't understand (because they only deal in abstractions) is that achieving performance involves literally the antithesis of abstraction - you need to understand your hardware down to the gate level (sometimes).

> loop tiling, pipelining, shared memory swizzle, memory coalescing

have you ever applied any of these? the only way you could apply these as a generic (without consideration of your particular hardware) algo is using a tuner; this is of course widely the route taken but that's not an "understanding" of anything except guess and check.

[+] bee_rider|3 years ago|reply

The article seems to be missing a conclusion.

Writing assembly doesn’t scale across lots of platforms? Sure… the solution for matrix multiplication is to use the vendor’s BLAS.

If the vendor can’t at least plop some kernels into BLIS they don’t want you to use their platform for matmuls… don’t fight them.

[+] Nevermark|3 years ago|reply

Exactly.

The problem is already "solved" to almost everyone's satisfaction by being O(N), i.e. one optimized matrix math library per platform.

But if they can reduce that to O(1) by creating a tool that takes computing hardware characteristics (core/compute topology, instructions, memory heirarchy, ...), and outputs state-of-the-art optimized matrix multiply machine code, it would be a nice and useful result.

[+] unknown|3 years ago|reply

[deleted]

[+] gleenn|3 years ago|reply

I really like the Neanderthal library because it does a pretty good job of abstracting over Nvidia, AMD, and Intel hardware to provide matrix operations in an extremely performant manner for each one with the same code. Dragan goes into a lot of detail about the hardware differences. His library provides some of the fastest implementations of using the given hardware too, it's not a hand-wavy, half-baked performance abstraction, the code is really fast. https://github.com/uncomplicate/neanderthal

[+] kickingvegas|3 years ago|reply

Off topic, but related. https://mastodon.social/@mcc/110024854706734967

[+] bigbillheck|3 years ago|reply

Surely one solution is for the AI frameworks to each themselves understand the operating environment and choose the best implementation at run-time, much like the way they currently do.

[+] brucethemoose2|3 years ago|reply

Yeah well tell all that to Nvidia, who very much likes the fragmentation and wants to keep things that way.

[+] misnome|3 years ago|reply

And they developed this fragmentation by... building good tools, good documentation, and comprehensively supporting them for 15 years in a way that makes people feel safe building on top of them.

It's not fragmentation, they built a moat.

[+] spookie|3 years ago|reply

It's not like other vendors have made meaningful efforts in alternatives. AMD still hasn't released RDNA3 support for ROCm, their open compute platform. Hell, I don't even think RDNA2 has proper support as of now.

There's also the issue of poor documentation and learning material in the wild.

[+] dekhn|3 years ago|reply

they are the one vendor who had the insight ~20 years ago to invest long-term in GPUs and have continuously made impressive products while supporting a cross-platform developer base. For this, I reward them with my $$$ (both work and home).

[+] EntrePrescott|3 years ago|reply

> performance has become increasingly constrained by memory latency, which has grown much slower than processing speeds.

Sounds like they would oddly prefer memory latency to grow as least as fast as processing speeds, which would be terrible. Obviously, memory latency actually decreased, just not enough.

So it seems likely they made a mistake and actually meant that memory latency has decreased slower than processing speeds have increased, in other words, that it is not memory latency but memory random access throughput (which in rough approximation is about proportional to the inverse of memory latency) that has grown much slower than processing speeds.

44 comments