Easier said than done. Even with Google level resources, TPU support for pytorch is patchy (https://arxiv.org/abs/2309.07181). Device abstraction is not great, assumes CUDA in unexpected places.
The Groq AI chip startup has solved this problem. They don't use hand written kernels at all, instead they use a compiler, and they have the top speed in the world on LLaMA2-70B, 240tokens/s.
Other interesting Groq tidbits - their models are deterministic, the whole system up to thousands of chips runs in sync on the same clock, memory access and network are directly controlled without any caches or intermediaries so they also run deterministically.
That speeds up communication and allows automatic synchronisation across thousands of chips running as one single large chip. The compiler does all the orchestration/optimisation. They can predict the exact performance of an architecture from compile time.
What makes Groq different is that they started from the compiler, and only later designed the hardware.
What is the pass rate on torchbench? This gives a more realistic measure of how good a vendor's pytorch support is.
All the big chip startups have their own pytorch compiler that works on the examples they write themselves. From what I've seen of Groq it doesn't appear to be any different.
The problem is that pytorch is incredibly permissive in what it lets users do. torch.compile is itself very new and far from optimal.
Pytorch XLA is such a pain to use. And once you go TPU you need the same energy to switch back, so you can’t quickly test out how it performs on your problem.
One of the big reasons custom hardware solutions struggle.
IMO - you’d have better luck as a hardware vendor implementing an LLM toolchain and bypassing a general purpose DL framework. At the very least you should be able to post impressive results with this approach rather than a half baked pytorch port.
Even after prioritising tensorflow, keras, jax etc., they can still afford to have a very large team working on torch_xla and still hedge their bets with a separate team on torch_mlir.
visarga|2 years ago
https://www.youtube.com/@GroqInc/videos
Other interesting Groq tidbits - their models are deterministic, the whole system up to thousands of chips runs in sync on the same clock, memory access and network are directly controlled without any caches or intermediaries so they also run deterministically.
That speeds up communication and allows automatic synchronisation across thousands of chips running as one single large chip. The compiler does all the orchestration/optimisation. They can predict the exact performance of an architecture from compile time.
What makes Groq different is that they started from the compiler, and only later designed the hardware.
elwypea|2 years ago
All the big chip startups have their own pytorch compiler that works on the examples they write themselves. From what I've seen of Groq it doesn't appear to be any different.
The problem is that pytorch is incredibly permissive in what it lets users do. torch.compile is itself very new and far from optimal.
omneity|2 years ago
lumost|2 years ago
IMO - you’d have better luck as a hardware vendor implementing an LLM toolchain and bypassing a general purpose DL framework. At the very least you should be able to post impressive results with this approach rather than a half baked pytorch port.
adolph|2 years ago
elwypea|2 years ago