Aiter: AI Tensor Engine for ROCm

[+] bayindirh|1 year ago|reply

I just want to remind everyone that El Capitan, Frontier and LUMI supercomputers are powered by AMD instinct cards.

El Capitan is #1 in TOP500. Frontier is #2, LUMI is #8.

ROCm development is probably mainly driven by the needs of these supercompuers' users currently.

So, we're seeing the tip of the iceberg.

Also ROCm packages continue to land on Debian, so there's more than meets the eye.

Note: Search "AMD Instinct" at https://top500.org/lists/top500/list/2024/11/. There are way more systems.

[+] slavik81|1 year ago|reply

> ROCm packages continue to land on Debian, so there's more than meets the eye

I've been volunteering with Debian to help package ROCm for four years now, but today it officially became my full-time job. AMA.

[+] brrrrrm|1 year ago|reply

Do super computers run in fp64 mostly? At fp8 an h100 hits 2 petaflops, and with only 1000 of them you’ve got more compute power than el capitan (in raw flop count)

[+] wkat4242|1 year ago|reply

They support their workstation cards pretty poorly though. I have a Radeon VII Pro and it's already deprecated in ROCm, it's not even 3 years old. They can really learn a lesson from Nvidia that supports old cards going back far and supports every card, not just a few hand-picked business models.

[+] saagarjha|1 year ago|reply

> ROCm development is probably mainly driven by the needs of these supercompuers' users currently.

Seems like a problem since AMD wants to go after AI capex?

[+] Tpt|1 year ago|reply

If I understand correctly, this library provides some Torch kernels customized for AMD hardware. Why haven't they just upstreamed them to PyTorch for better adoption? Also, they seem to demo usage with Torch default eager execution mode and not Torch JIT/TorchScript. Is this library compatible with TorchScript?

[+] microtonal|1 year ago|reply

I think a lot of stuff will get upstreamed eventually. PyTorch just moves slower and since it’s a stable library, I think it cannot rapidly adopt something like fused MoE until the dust has settled a little and it’s clear what the API would look like long-term.

I think it’s ok that stuff is tried first in Torch extensions. That’s how Flash Attention started after all and the same is true for newer kernels in CUDA-land (fused MoE, MLA, Marlin, etc.).

With regards to TorchScript, that’s really legacy - torch.compile is where it’s at. This post seems to suggest that the kernels work with torch.compile: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...

[+] barrenko|1 year ago|reply

I really do not understand why can't they just work with existing OSS developers pulling their hair out trying to make AMD devices work and instead do it this way. It's like Mozilla with the questionable decisions.

[+] kouteiheika|1 year ago|reply

> Why haven't they just upstreamed them to PyTorch for better adoption?

They don't seem to care, or don't understand how to get broader adoption.

For some reason AMD's management is dead set on targeting only the high end part of the market. Like, for example, look at this blog post. Which model they're testing? DeepSeek R1, the 671B behemoth that no normal person can run. Or look at any of their tutorials/docs and see which GPUs they support - it's always only either the unobtanium-grade enterprise GPUs, or high end workstation cards that no one buys. And if your strategy is to target only the super rich entities then a little jank in the software isn't really all that punishing - if you can afford to drop a few million on GPUs then you can also afford to hire someone to spend a few weeks getting AMD's software to work/get it tuned by tweaking two dozen environment variables they do seem to like so much/etc.

[+] imtringued|1 year ago|reply

That would make the kernels the PyTorch Foundations's problem and they would have to set up CI infrastructure around AMD GPUs to maintain these kernels. For whatever reason, AMD really wants to keep everything in-house even though that has been a losing strategy so far.

[+] carbocation|1 year ago|reply

I'm not a python expert, but this feels very odd to me (both the *init* construction and the return [tgemm.mm](http://tgemm.mm/)(input, self.weight, self.bias, None, None) call, which looks like markdown to me:

    from aiter.tuned_gemm import tgemm
    import torch
    
    class LinearLayer(torch.nn.Module):
     def **init**(self, in_features, out_features):
      super(LinearLayer, self).**init**()
      self.weight = torch.nn.Parameter(torch.randn(out_features, in_features).cuda())
      self.bias = torch.nn.Parameter(torch.randn(out_features).cuda())
    
     def forward(self, input):
      input = input.cuda()
      return [tgemm.mm](http://tgemm.mm/)(input, self.weight, self.bias, None, None)

[+] Lerc|1 year ago|reply

I was puzzling over the code wondering why they .cuda() everything like that when I realised that that was only the beginning of the weirdness.

I'm assuming the scrambled annotations were due to some odd chain of things the code went through on the way to becoming a post.

Maybe they did it as a parable about the problems of having many layers of abstraction causing processes with unintended consequences?

[+] cavisne|1 year ago|reply

Yeah this is AMD in a nutshell. A bunch of fluffy descriptions and then the only concrete example would clearly never run.

EDIT: They fixed the code pretty quickly

[+] evertedsphere|1 year ago|reply

yep the syntax highlighting / doc hyperlinking clearly broke there (or, less charitably, whatever llm produced that prose had a moment)

it's __init__ of course

[+] fock|1 year ago|reply

also why is it calling .cuda() to move tensors to a cuda driver? I suppose this is because this is based on HIP - which comes with it's own set of problems, but that's ROCm for the masses I guess.

Also the tgemm.mm has to be a torch module (at first I thought this was some lowlevel library which they now have a preview of, because there is a ROCm-torch already ...) which is evident from the table just before the summary. That table also smells like they are mostly focused on inference...

EDIT: seems official ROCm-torch is also based on HIP.

[+] cavisne|1 year ago|reply

So to do an efficient MM on AMD you need to find every MM in the pytorch model and replace it with a call to this library? Seems like something that should've been fixed years ago.

Also I assume nvidia does the same thing but it is still hilarious that this is how it works

https://github.com/ROCm/aiter/blob/main/aiter/configs/bf16_t...

[+] dailykoder|1 year ago|reply

Still waiting for ROCm on my cheap Radeon RX 7600. Would be nice to play around with it a little. I know that this card is nothing fancy. There is somewhere a github issue where they announced to port it for linux to consumer cards, but last time I checked (a few days ago) it still wasn't available

[+] oynqr|1 year ago|reply

I used rocm on an RX 7600 a month after launch. Having no official support does not at all mean it doesn't work.

[+] Rounin|1 year ago|reply

You should be able to make it think you have another card: export HSA_OVERRIDE_GFX_VERSION=10.3.0 The possible values are said to be: # gfx1030 = "10.3.0" # gfx900 = "9.0.0" # gfx906 = "9.0.6" # gfx908 = "9.0.8" # gfx90a = "9.0.a"

[+] slavik81|1 year ago|reply

Use the PyTorch Nightly build. The ROCm libraries themselves have been built for the RX 7600 (gfx1102) since ROCm 5.4/5.5, but PyTorch itself wasn't enabled until a few weeks ago. The RX 7600 is still not 'officially supported' on Linux, but I have an RX 7600 XT and I haven't encountered any issues in my (admittedly intermittent) use of the card in AI applications. You may, however, find the 8GB of VRAM in the non-XT version to be a limitation.

[+] fancyfredbot|1 year ago|reply

Wow, it sure sounds like a mess under there. They used 4 different languages?

Using one high level language and assembly sounds fine, but four feels incoherent. Would love to know why this has had happened.

"This infrastructure is built upon a variety of underlying technologies, including Triton, CK (Compute Kernel), ASM (Assembly), and HIP (Heterogeneous Interface for Portability)."

[+] jfim|1 year ago|reply

That's not exactly unusual, for example pytorch has Python, C++, C, and Cuda.

[+] daeken|1 year ago|reply

Those aren't four different languages. CK and HIP are both just libraries.

[+] shihab|1 year ago|reply

Wait, did they get their own library name wrong? CK should be Composable Kernel, I can’t find anything called compute kernel anywhere

[+] yu3zhou4|1 year ago|reply

Really interesting, how it compares to tinygrad support for AMD GPUs?

[+] WhitneyLand|1 year ago|reply

Performance increased 100% on an MI300X running a large LLM.

On one hand, cool. On the other hand wow have they been leaving a lot of performance on the table.

How does the performance compare to NVidia now?

[+] mjburgess|1 year ago|reply

Any one try any of this on a few 7900xtx (or familiarity with this hardware and platform)? I've just purchased 6 for some small-scale experimentation. I'm thinking the next machine I'll use AMD Radeon PRO W7900 (to get 128 GB VRAM / machine).

[+] almostgotcaught|1 year ago|reply

Just export HSA_OVERRIDE_GFX_VERSION=11.0.0 and things should mostly work. Off the top of my head, some of the fp8 types aren't supported but <shrug>

[+] manjunaths|1 year ago|reply

I have a 7900 GRE, which is the same except less memory. I run Gemma 3, LLama 3.1, the QwQ models and the DeepSeek distilled models using llama.cpp. They run fine, I especially like the new Gemma3-27b-Q6 (20 GB model), I get 2 tok/s on it.

I have also run Hunyuan3d-2 and generated 3d models. You would've to separate out the model generation and texture generation phase, but it works.

I run ComfyUI and bootleg gguf models. This is all on windows. Now even WSL2 works, so I am using Ubuntu-24.04 on Windows 11 to run Hunyuan3D-2.

For LLMs, llama.cpp native binaries are available. Everything just works out of the box.

[+] fngarrett|1 year ago|reply

We have a dual W7800 system in-house as our `gfx1100` rig. I'll try to install and run through the tests sometime this week.

88 comments