top | item 37911960

(no title)

LukeB42 | 2 years ago

>you'd need a pretty custom toolchain

Just port/extend pytorch to make it easy to utilize fused ops.

discuss

elwypea|2 years ago

Easier said than done. Even with Google level resources, TPU support for pytorch is patchy (https://arxiv.org/abs/2309.07181). Device abstraction is not great, assumes CUDA in unexpected places.

visarga|2 years ago

The Groq AI chip startup has solved this problem. They don't use hand written kernels at all, instead they use a compiler, and they have the top speed in the world on LLaMA2-70B, 240tokens/s.

https://www.youtube.com/@GroqInc/videos

Other interesting Groq tidbits - their models are deterministic, the whole system up to thousands of chips runs in sync on the same clock, memory access and network are directly controlled without any caches or intermediaries so they also run deterministically.

That speeds up communication and allows automatic synchronisation across thousands of chips running as one single large chip. The compiler does all the orchestration/optimisation. They can predict the exact performance of an architecture from compile time.

What makes Groq different is that they started from the compiler, and only later designed the hardware.

omneity|2 years ago

Pytorch XLA is such a pain to use. And once you go TPU you need the same energy to switch back, so you can’t quickly test out how it performs on your problem.

adolph|2 years ago

Typically Google resources go to TF before PyTorch, no?