top | item 36167948

Render a neural network into CUDA/HIP code

177 points| fzliu | 2 years ago |github.com | reply

65 comments

order
[+] antinucleon|2 years ago|reply
AITemplate's original designer is here. We quit Meta in January and start HippoML (https://hippoml.com/). We just disclosed our new engine's performance on LLM: https://blog.hippoml.com/large-language-model-inference-from... On Apple M2 Max our new engine encode/decode is 13.8X/2.4X faster than llama.cpp
[+] mhh__|2 years ago|reply
Really doesn't surprise me that much. Llama.cpp seems like an OK first passs but I assume there is loads of time left on the table in terms of graph optimizations optimizing for the memory hierarchy properly.
[+] brucethemoose2|2 years ago|reply
Very interesting.

Is 8bit/4bit support in the works? Will it work with bitsandbytes out of the box? Speedy inference is great, but in practice many users are running the biggest ~4-bit LLM that will fit into their RAM/VRAM pool these days. This is why llama.cpp is so good, its (AFAIK) the only implementation that will split a 4 bit quantized model so easily.

[+] ralfd|2 years ago|reply
What is your planned business model here?
[+] yeison|2 years ago|reply
Will this have some similarities to what Mojo is trying to solve?
[+] sroussey|2 years ago|reply
Would it work with instructor-xl or similar which is designed for embeddings and retrieval? On device for privacy is key.
[+] thewataccount|2 years ago|reply
Do you know how it's speed compares to exllama, specifically with an nvidia gpu by chance?
[+] huevosabio|2 years ago|reply
Any idea how hippo, AI Template and TVM compare in performance?
[+] yeison|2 years ago|reply
Did Facebook invest in this. Is that why it's under Facebookincubator?
[+] brucethemoose2|2 years ago|reply
Also here are some other interesting projects in the ML compilation space:

- Apache TVM (mlc-llm is a good demo)

- Hidet (a torch.compile backend)

- Alibaba BladeDISC

- Nvidia TensorRT (a classic, but much less of a nightmare to install now)

- Torch MLIR (SHARK has some demos/implementations)

[+] brucethemoose2|2 years ago|reply
I just ran a 512x512 Stable Diffusion benchmark with this yesterday

Pytorch Eager Mode with some optimizations: ~6it/s

Pytorch Inductor (torch.compile with dynamic=True): ~7it/s

AITemplate: ~9it/s

All of them support changing settings and such, albeit with some work in progress bugs/caveats.

That is 512x512 on a 2060, so I would expect the gains to be bigger on newer GPUs with more overhead to take advantage of.

[+] maxilevi|2 years ago|reply
Did you try TensorRT?
[+] sosodev|2 years ago|reply
The latency improvements are impressive but the ability to run models beyond their typical memory limitations is way cooler.
[+] homarp|2 years ago|reply
CUDA: NVIDIA GPU 'framework'

HIP: AMD GPU 'framework'

This takes Neural Network defined in python and convert them to C++ code calling CUDA / HIP for maximum inference speed

[+] hintymad|2 years ago|reply
I like this humility: "AITemplate is co-created by Meta engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry Chen, with major contributions coming from more talented engineers."
[+] Havoc|2 years ago|reply
Interesting that AMD GPUs seem to be 1st class citizens here. Consumer class gear is much cheaper per unit of VRAM by the looks of it
[+] rerx|2 years ago|reply
The thing with AMD is that the selection of GPUs that can actually run your GPGPU code is typically much more limited then with Nvidia, so things work on far fewer consumer GPUd. Here it's limited to "CDNA2 (MI-210/250) GPUs". Those are priced comparatively to Nvidia's A100, 10k+ per card.
[+] born-jre|2 years ago|reply
at first glance i thought may be its like tinygrad. but looks has many ops than tiny grad but most maps to underlying hardware provided ops?

i wonder how well tinygrad's apporach will work out, ops fusion sounds easy, just walk a graph, pattern match it and lower to hardware provided ops?

Anyway if anyone wants to understand the philosophy behind tinygrad, this file is great start https://github.com/geohot/tinygrad/blob/master/docs/abstract...

[+] philipturner|2 years ago|reply
antinucleon lol

LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes.

Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:

https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...

I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels.

[+] philipturner|2 years ago|reply
To give more credit, I know your company was working on your own optimizations, which are prior art. It's possible that you made a 180 GB/s shader on your own (quite slow compared to my 319 GB/s). Or that the 319 GB/s was used, but the self-attention bottleneck was non-negligible.

However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.

[+] samstave|2 years ago|reply
ELI5 what this means?

I am losing my bibliography, etymology and vocabulary with every single AI advancement article.

Where learn AI vocab, please?

-

I nee an FN AI teacher to just give me daily updates on AI and verbiage, models, etc...

Hey AI - if you're so smart, build a podcast that teaches me about yourself and how to be a better meat parent whom made you.///\\\\

[+] skirmish|2 years ago|reply
Starting with a trained PyTorch model, it builds optimized C++ binaries for running inference (not training) on NVidia and AMD GPUs. Various optimizations mentioned a lot, so presumably models run faster than with just running them via regular PyTorch.
[+] iaw|2 years ago|reply
Very much not an expert here but what I understand is that most deep learning frameworks (PyTorch, Tensorflow, etc.) have some overhead associated with them just being on the graphics card. This takes PyTorch code and removes the overhead by translating the network into a "native" language for the card (CUDA for NVIDIA).

What I'm not sure is what "HIP" is in this context.

The way I'm reading this is it's the difference between running code in an interpreter vs. on the bare metal (for the GPU)

[+] femto113|2 years ago|reply
It doesn't really help understand what they are, but for completeness CUDA is an acronym for "Compute Unified Device Architecture" while HIP is "Heterogeneous-compute Interface for Portability"
[+] bagels|2 years ago|reply
What is an FN AI?
[+] cypress66|2 years ago|reply
I don't see any comparisons with torch.compile. Kind of unfair to compare it to eager mode.
[+] pavelstoev|2 years ago|reply
Would like to see updated benchmarks supporting the claimed performance gains in this project. Currently showing torch 1.12 which is rather weak compared to the latest 2.0 and torch.compile()
[+] iaw|2 years ago|reply
Anyone seen details on if this can handle splitting a model across GPUs?