(no title)
junrushao1994 | 2 years ago
There are two points I personally wanted to make through this project:
1) With a sufficiently optimized software stack, AMD GPUs can be sufficiently cost-efficient to use in LLM serving; 2) ML compilation (MLC) techniques, through its underlying TVM Unity software stack, are the best fit in terms of cross-hardware generalizable performance optimizations, quickly delivering time-to-market values, etc.
So far, to the best of our knowledge, MLC LLM delivers the best performance across NVIDIA and AMD GPUs in single-batch inference on quantized models, and batched/distributed inference is on the horizon too.
htirwklj4523432|2 years ago
Can you comment on how difficult it was to achieve this, and what the relative advantages b/w cards ? AFAIR, AMD cards were not not deemed competitive with Nvidia in DL space largely because of the amazing job Nvidia pulled off with CUDNN and its conv. kernels.
LLMs etc. OTOH doesn't really depend on convolutions (atleast the pure transformer bits), and instead depends a lot more on plain old GEMM + low-bit float/int compute.
junrushao1994|2 years ago
Thanks for asking! I personally believe TVM Unity is a proper software stack for ML compilation (MLC), and its existing optimizations (e.g. TensorCore offloading) can be transparently transferred to AMD/Intel/Apple/mobile GPUs without too much engineering effort.
Of course my claim is limited to ML workloads. Not an expert outside the ML world, so I couldn't say for general HPC.
gsuuon|2 years ago
Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)
junrushao1994|2 years ago
PeterStuer|2 years ago
One question: given your experience, when would you predict a near parity in software stack support between te different platforms, so that a choice of GPU becomes one mostly of price/performance? It does not need to be like the AMD/Intel in the CPU market where a consumer will have no doubts about software compatibility, but let's say like the gaming gpu market where a game having problems on a gpu architecture is a newsworthy exception that is quickly corrected.
PeterStuer|2 years ago
JonChesterfield|2 years ago
I don't know whether there's a LLM inference benchmark in the CI suite, if not perhaps something like this should be included in it.
junrushao1994|2 years ago
crowwork|2 years ago
Const-me|2 years ago
junrushao1994|2 years ago
KingOfCoders|2 years ago
sullx|2 years ago
Support in TVM’s graph IR (Relax) - https://github.com/apache/tvm/pull/15447 Support in TVM’s loop IR (TensorIR) - https://github.com/apache/tvm/pull/14862 Distributed dialect of TVM’s graph IR for multi-node (GSPMD-type): https://github.com/apache/tvm/pull/15289
The first target will be LLM's on multiple NVIDIA GPUs but as with all of MLC-LLM effort, the approach will generalize to other hardware including AMD's wonderful hardware.
tails4e|2 years ago
brucethemoose2|2 years ago
The catch is:
- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)
- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).
junrushao1994|2 years ago
bravura|2 years ago
Good work though. And you have an activity community on github, congratulations.
junrushao1994|2 years ago
postmeta|2 years ago
might be interesting to team up
melony|2 years ago
junrushao1994|2 years ago
crowwork|2 years ago
gsuuon|2 years ago