Render a neural network into CUDA/HIP code

[+] antinucleon|2 years ago|reply

AITemplate's original designer is here. We quit Meta in January and start HippoML (https://hippoml.com/). We just disclosed our new engine's performance on LLM: https://blog.hippoml.com/large-language-model-inference-from... On Apple M2 Max our new engine encode/decode is 13.8X/2.4X faster than llama.cpp

[+] mhh__|2 years ago|reply

Really doesn't surprise me that much. Llama.cpp seems like an OK first passs but I assume there is loads of time left on the table in terms of graph optimizations optimizing for the memory hierarchy properly.

[+] brucethemoose2|2 years ago|reply

Very interesting.

Is 8bit/4bit support in the works? Will it work with bitsandbytes out of the box? Speedy inference is great, but in practice many users are running the biggest ~4-bit LLM that will fit into their RAM/VRAM pool these days. This is why llama.cpp is so good, its (AFAIK) the only implementation that will split a 4 bit quantized model so easily.

[+] ralfd|2 years ago|reply

What is your planned business model here?

[+] yeison|2 years ago|reply

Will this have some similarities to what Mojo is trying to solve?

[+] sroussey|2 years ago|reply

Would it work with instructor-xl or similar which is designed for embeddings and retrieval? On device for privacy is key.

[+] thewataccount|2 years ago|reply

Do you know how it's speed compares to exllama, specifically with an nvidia gpu by chance?

[+] huevosabio|2 years ago|reply

Any idea how hippo, AI Template and TVM compare in performance?

[+] yeison|2 years ago|reply

Did Facebook invest in this. Is that why it's under Facebookincubator?

[+] brucethemoose2|2 years ago|reply

Also here are some other interesting projects in the ML compilation space:

- Apache TVM (mlc-llm is a good demo)

- Hidet (a torch.compile backend)

- Alibaba BladeDISC

- Nvidia TensorRT (a classic, but much less of a nightmare to install now)

- Torch MLIR (SHARK has some demos/implementations)

[+] jahewson|2 years ago|reply

And of course, Chris Lattner’s Modular AI https://www.modular.com/

[+] brucethemoose2|2 years ago|reply

I just ran a 512x512 Stable Diffusion benchmark with this yesterday

Pytorch Eager Mode with some optimizations: ~6it/s

Pytorch Inductor (torch.compile with dynamic=True): ~7it/s

AITemplate: ~9it/s

All of them support changing settings and such, albeit with some work in progress bugs/caveats.

That is 512x512 on a 2060, so I would expect the gains to be bigger on newer GPUs with more overhead to take advantage of.

[+] maxilevi|2 years ago|reply

Did you try TensorRT?

[+] sosodev|2 years ago|reply

The latency improvements are impressive but the ability to run models beyond their typical memory limitations is way cooler.

[+] homarp|2 years ago|reply

CUDA: NVIDIA GPU 'framework'

HIP: AMD GPU 'framework'

This takes Neural Network defined in python and convert them to C++ code calling CUDA / HIP for maximum inference speed

[+] hintymad|2 years ago|reply

I like this humility: "AITemplate is co-created by Meta engineers: Bing Xu, Ying Zhang, Hao Lu, Yang Chen, and Terry Chen, with major contributions coming from more talented engineers."

[+] throwaway14356|2 years ago|reply

aliens

[+] Havoc|2 years ago|reply

Interesting that AMD GPUs seem to be 1st class citizens here. Consumer class gear is much cheaper per unit of VRAM by the looks of it

[+] rerx|2 years ago|reply

The thing with AMD is that the selection of GPUs that can actually run your GPGPU code is typically much more limited then with Nvidia, so things work on far fewer consumer GPUd. Here it's limited to "CDNA2 (MI-210/250) GPUs". Those are priced comparatively to Nvidia's A100, 10k+ per card.

[+] born-jre|2 years ago|reply

at first glance i thought may be its like tinygrad. but looks has many ops than tiny grad but most maps to underlying hardware provided ops?

i wonder how well tinygrad's apporach will work out, ops fusion sounds easy, just walk a graph, pattern match it and lower to hardware provided ops?

Anyway if anyone wants to understand the philosophy behind tinygrad, this file is great start https://github.com/geohot/tinygrad/blob/master/docs/abstract...

[+] philipturner|2 years ago|reply

antinucleon lol

LLaMA is a memory-bound AI model, where the dominant factor in execution time is how fast the processor transfers weights from RAM to registers. LLaMA.cpp uses a misaligned memory pattern that's painful to the RAM I/O interface and requires many CPU instructions to re-align. It also has access to 1/2 the bandwidth of the GPU on Apple's highest-end chips, making GPU *theoretically* 2x faster without algorithm changes.

Funny how you announced it the exact day after I open-sourced a high-bandwidth 4-bit GEMV kernel for LLaMA. If anyone wants to see how to achieve 180+ GB/s on the M1/M2 Max, you can reproduce the methodology here:

https://github.com/ggerganov/llama.cpp/pull/1642#issuecommen...

I later explained the technique to write very optimized kernels for a specific quantization format. *It's very unlikely my code was copied verbatim*, but it was probably used as inspiration and/or a reference. I also disclosed a high-precision means of measuring bandwidth utilization, which is critical for designing such GPU kernels.

[+] philipturner|2 years ago|reply

To give more credit, I know your company was working on your own optimizations, which are prior art. It's possible that you made a 180 GB/s shader on your own (quite slow compared to my 319 GB/s). Or that the 319 GB/s was used, but the self-attention bottleneck was non-negligible.

However, for whatever reason, when Greg Gerganov started work on the Metal backend, you made a product announcement almost a day later. That seems like a non-coincidence and there must be some logical explanation.

[+] samstave|2 years ago|reply

ELI5 what this means?

I am losing my bibliography, etymology and vocabulary with every single AI advancement article.

Where learn AI vocab, please?

-

I nee an FN AI teacher to just give me daily updates on AI and verbiage, models, etc...

Hey AI - if you're so smart, build a podcast that teaches me about yourself and how to be a better meat parent whom made you.///\\\\

[+] skirmish|2 years ago|reply

Starting with a trained PyTorch model, it builds optimized C++ binaries for running inference (not training) on NVidia and AMD GPUs. Various optimizations mentioned a lot, so presumably models run faster than with just running them via regular PyTorch.

[+] iaw|2 years ago|reply

Very much not an expert here but what I understand is that most deep learning frameworks (PyTorch, Tensorflow, etc.) have some overhead associated with them just being on the graphics card. This takes PyTorch code and removes the overhead by translating the network into a "native" language for the card (CUDA for NVIDIA).

What I'm not sure is what "HIP" is in this context.

The way I'm reading this is it's the difference between running code in an interpreter vs. on the bare metal (for the GPU)

[+] femto113|2 years ago|reply

It doesn't really help understand what they are, but for completeness CUDA is an acronym for "Compute Unified Device Architecture" while HIP is "Heterogeneous-compute Interface for Portability"

[+] bagels|2 years ago|reply

What is an FN AI?

[+] cypress66|2 years ago|reply

I don't see any comparisons with torch.compile. Kind of unfair to compare it to eager mode.

[+] pavelstoev|2 years ago|reply

Would like to see updated benchmarks supporting the claimed performance gains in this project. Currently showing torch 1.12 which is rather weak compared to the latest 2.0 and torch.compile()

[+] iaw|2 years ago|reply

Anyone seen details on if this can handle splitting a model across GPUs?

[+] bguberfain|2 years ago|reply

It reminds me Theano

65 comments