top | item 37068983

(no title)

junrushao1994 | 2 years ago

One of the authors here. Glad it’s on HackerNews!

There are two points I personally wanted to make through this project:

1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving; 2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.

So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.

discuss

order

htirwklj4523432|2 years ago

The numbers look amazing.

Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.

LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.

junrushao1994|2 years ago

> Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards?

Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.

Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.

gsuuon|2 years ago

Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.

Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)

junrushao1994|2 years ago

This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)

PeterStuer|2 years ago

Thank you for this work. I will be staying on nvidia for now, but applaud any progress towards much needed credible competition in the consumer/enthusiast AI hardware space.

One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.

PeterStuer|2 years ago

Honestly at a loss why this got downvoted.

JonChesterfield|2 years ago

Did the ROCm 5.6 toolchain work for you out of the box? If not, what sort of hacking / hand holding did it need?

I don't know whether there's a LLM inference benchmark in the CI suite, if not perhaps something like this should be included in it.

crowwork|2 years ago

Yes, it works out of box and the blog contains a prebuilt python package that you can try out

Const-me|2 years ago

Have you tested Vulkan API on the 7900 XTX? Was it faster or slower than ROCm?

junrushao1994|2 years ago

Generally speaking I expect Vulkan to be slower than ROCm given it's designed for generic gaming across GPU vendors, so the takeaway is, whenever ROCm is available and usable, we should use ROCm. And it's the same for CUDA vs Vulkan.

KingOfCoders|2 years ago

Can I use two at the same time? Two 7900 XTX would be the price of 1 4090 but with much higher performance (260tok/sec)

sullx|2 years ago

This is coming! Myself and others at OctoML and in the TVM community are actively working on multi-gpu support in the compiler and runtime. Here are some of the merged and active PRs on the multi-GPU (multi-device) roadmap:

Support in TVM’s graph IR (Relax) - https://github.com/apache/tvm/pull/15447 Support in TVM’s loop IR (TensorIR) - https://github.com/apache/tvm/pull/14862 Distributed dialect of TVM’s graph IR for multi-node (GSPMD-type): https://github.com/apache/tvm/pull/15289

The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.

tails4e|2 years ago

When you say best performance on nvidia, do you mean against any other method of running this model an nvidia card?

brucethemoose2|2 years ago

I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).

junrushao1994|2 years ago

yeah we tried out popular solutions like exllama and llama.cpp among others that support inference of 4bit quantized models

bravura|2 years ago

Thanks! Just curious why there is no "team" or "about us" page? It's nice sharing credit, but it also is a little unsettling when blog posts do not name contributors.

Good work though. And you have an activity community on github, congratulations.

junrushao1994|2 years ago

Well, I'm very much into true open source, and my belief is that any contributor is automatically part of the team :)

melony|2 years ago

Does it work with WSL2?

junrushao1994|2 years ago

Really depends on how good ROCm support for WSL2 is. Our team don't have a windows machine so could not verify ourselves, but if you got ROCm set up properly on WSL2, MLC LLM should work out of the box

crowwork|2 years ago

You can also try out the vulkan backend, which we know should work for windows, although speed might be slower than rocm

gsuuon|2 years ago

FWIW I did get the CUDA backend running via WSL2