Really doesn't surprise me that much. Llama.cpp seems like an OK first passs but I assume there is loads of time left on the table in terms of graph optimizations optimizing for the memory hierarchy properly.
Is 8bit/4bit support in the works? Will it work with bitsandbytes out of the box? Speedy inference is great, but in practice many users are running the biggest ~4-bit LLM that will fit into their RAM/VRAM pool these days. This is why llama.cpp is so good, its (AFAIK) the only implementation that will split a 4 bit quantized model so easily.
I like this humility: "AITemplate is co-created by Meta engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry Chen, with major contributions coming from more talented engineers."
The thing with AMD is that the selection of GPUs that can actually run your GPGPU code is typically much more limited then with Nvidia, so things work on far fewer consumer GPUd. Here it's limited to "CDNA2 (MI-210/250) GPUs". Those are priced comparatively to Nvidia's A100, 10k+ per card.
LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes.
Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:
I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels.
To give more credit, I know your company was working on your own optimizations, which are prior art. It's possible that you made a 180 GB/s shader on your own (quite slow compared to my 319 GB/s). Or that the 319 GB/s was used, but the self-attention bottleneck was non-negligible.
However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.
Starting with a trained PyTorch model, it builds optimized C++ binaries for running inference (not training) on NVidia and AMD GPUs. Various optimizations mentioned a lot, so presumably models run faster than with just running them via regular PyTorch.
Very much not an expert here but what I understand is that most deep learning frameworks (PyTorch, Tensorflow, etc.) have some overhead associated with them just being on the graphics card. This takes PyTorch code and removes the overhead by translating the network into a "native" language for the card (CUDA for NVIDIA).
What I'm not sure is what "HIP" is in this context.
The way I'm reading this is it's the difference between running code in an interpreter vs. on the bare metal (for the GPU)
It doesn't really help understand what they are, but for completeness CUDA is an acronym for "Compute Unified Device Architecture" while HIP is "Heterogeneous-compute Interface for Portability"
Would like to see updated benchmarks supporting the claimed performance gains in this project. Currently showing torch 1.12 which is rather weak compared to the latest 2.0 and torch.compile()
[+] [-] antinucleon|2 years ago|reply
[+] [-] mhh__|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply
Is 8bit/4bit support in the works? Will it work with bitsandbytes out of the box? Speedy inference is great, but in practice many users are running the biggest ~4-bit LLM that will fit into their RAM/VRAM pool these days. This is why llama.cpp is so good, its (AFAIK) the only implementation that will split a 4 bit quantized model so easily.
[+] [-] ralfd|2 years ago|reply
[+] [-] yeison|2 years ago|reply
[+] [-] sroussey|2 years ago|reply
[+] [-] thewataccount|2 years ago|reply
[+] [-] huevosabio|2 years ago|reply
[+] [-] yeison|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply
- Apache TVM (mlc-llm is a good demo)
- Hidet (a torch.compile backend)
- Alibaba BladeDISC
- Nvidia TensorRT (a classic, but much less of a nightmare to install now)
- Torch MLIR (SHARK has some demos/implementations)
[+] [-] jahewson|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply
Pytorch Eager Mode with some optimizations: ~6it/s
Pytorch Inductor (torch.compile with dynamic=True): ~7it/s
AITemplate: ~9it/s
All of them support changing settings and such, albeit with some work in progress bugs/caveats.
That is 512x512 on a 2060, so I would expect the gains to be bigger on newer GPUs with more overhead to take advantage of.
[+] [-] maxilevi|2 years ago|reply
[+] [-] sosodev|2 years ago|reply
[+] [-] homarp|2 years ago|reply
HIP: AMD GPU 'framework'
This takes Neural Network defined in python and convert them to C++ code calling CUDA / HIP for maximum inference speed
[+] [-] hintymad|2 years ago|reply
[+] [-] throwaway14356|2 years ago|reply
[+] [-] Havoc|2 years ago|reply
[+] [-] rerx|2 years ago|reply
[+] [-] born-jre|2 years ago|reply
i wonder how well tinygrad's apporach will work out, ops fusion sounds easy, just walk a graph, pattern match it and lower to hardware provided ops?
Anyway if anyone wants to understand the philosophy behind tinygrad, this file is great start https://github.com/geohot/tinygrad/blob/master/docs/abstract...
[+] [-] philipturner|2 years ago|reply
LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes.
Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:
https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...
I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels.
[+] [-] philipturner|2 years ago|reply
However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.
[+] [-] samstave|2 years ago|reply
I am losing my bibliography, etymology and vocabulary with every single AI advancement article.
Where learn AI vocab, please?
-
I nee an FN AI teacher to just give me daily updates on AI and verbiage, models, etc...
Hey AI - if you're so smart, build a podcast that teaches me about yourself and how to be a better meat parent whom made you.///\\\\
[+] [-] skirmish|2 years ago|reply
[+] [-] iaw|2 years ago|reply
What I'm not sure is what "HIP" is in this context.
The way I'm reading this is it's the difference between running code in an interpreter vs. on the bare metal (for the GPU)
[+] [-] femto113|2 years ago|reply
[+] [-] bagels|2 years ago|reply
[+] [-] cypress66|2 years ago|reply
[+] [-] pavelstoev|2 years ago|reply
[+] [-] iaw|2 years ago|reply
[+] [-] bguberfain|2 years ago|reply