> It is also important to note that, until recently, the GenAI industry’s focus has largely been on training workloads. In training workloads, CUDA is very important, but when it comes to inference, even reasoning inference, CUDA is not that important, so the chances of expanding the TPU footprint in inference are much higher than those in training (although TPUs do really well in training as well – Gemini 3 the prime example).Does anyone have a sense of why CUDA is more important for training than inference?
augment_me|3 months ago
Once you have trained, you have frozen weights/feed-forward networks that consist out of frozen weights that you can just program in and run data over. These weights can be duplicated across any amount of devices and just sit there and run inference with new data.
If this turns out to be the future use-case for NNs(it is today), then Google are better set.
grandmczeb|3 months ago
eikenberry|3 months ago
rbanffy|3 months ago
A real shame, BTW, all that silicon doesn't do FP32 (very well). After training ceases to be that needed, we could use all that number crunching for climate models and weather prediction.
zmmmmm|3 months ago
markhahn|3 months ago
llm_nerd|3 months ago
Further it's worth noting that the Ironwood, Google's v7 TPU, supports only up to BF16 (a 16-bit floating point that has the range of FP32 minus the precision. Many training processes rely upon larger types, quantizing later, so this breaks a lot of assumptions. Yet Google surprised and actually training Gemini 3 with just that type, so I think a lot of people are reconsidering assumptions.
qeternity|3 months ago
imtringued|3 months ago
Another factor is that training is always done with batches. Inference batching depends on the number of concurrent users. This means training tends to be compute bound where supporting the latest data types is critical, whereas inference speeds are often bottlenecked by memory which does not lend itself to product differentiation. If you put the same memory into your chip as your competitor, the difference is going to be way smaller.
Traster|3 months ago
jeffbee|3 months ago
johnebgd|3 months ago
qcnguy|3 months ago
Once you settle on a design then doing ASICs to accelerate it might make sense. But I'm not sure the gap is so big, the article says some things that aren't really true of datacenter GPUs (Nvidia dc gpus haven't wasted hardware on graphics related stuff for years).
baby_souffle|3 months ago
NaomiLehman|3 months ago
never_inline|3 months ago
What does it even mean in neural net context?
> numerical stability
also nice to expand a bit.