top | item 46070381

(no title)

1980phipsi | 3 months ago

> It is also important to note that, until recently, the GenAI industry’s focus has largely been on training workloads. In training workloads, CUDA is very important, but when it comes to inference, even reasoning inference, CUDA is not that important, so the chances of expanding the TPU footprint in inference are much higher than those in training (although TPUs do really well in training as well – Gemini 3 the prime example).

Does anyone have a sense of why CUDA is more important for training than inference?

discuss

augment_me|3 months ago

NVIDIA chips are more versatile. During training, you might need to schedule things to the SFU(Special Function unit that does sin, cos, 1/sqrt(x), etc), you might need to run epilogues, save intermediary computations, save gradients, etc. When you train, you might need to collect data from various GPUs, so you need to support interconnects, remote SMEM writing, etc.

Once you have trained, you have frozen weights/feed-forward networks that consist out of frozen weights that you can just program in and run data over. These weights can be duplicated across any amount of devices and just sit there and run inference with new data.

If this turns out to be the future use-case for NNs(it is today), then Google are better set.

grandmczeb|3 months ago

All of those are things you can do with TPUs

eikenberry|3 months ago

Won't the need to train increase as the need for specialized, smaller models increases and we need to train their many variations? Also what about models that continuously learn/(re)train? Seems to me the need for training will only go up in the future.

rbanffy|3 months ago

This is a very important point - the market for training chips might be a bubble, but the market for inference is much, much larger. At some point we might have good enough models and the need for new frontier models will cool down. The big power-hungry datacenters we are seeing are mostly geared towards training, while inference-only systems are much simpler and power efficient.

A real shame, BTW, all that silicon doesn't do FP32 (very well). After training ceases to be that needed, we could use all that number crunching for climate models and weather prediction.

zmmmmm|3 months ago

it's already the case that people are eeking out most further gains through layering "reasoning" on top of what existing models can do - in other words, using massive amounts of inference to substitute for increases model performance. Whereever things plateau I expect this will still be the case - so inference ultimately will always be the end game market.

markhahn|3 months ago

Some more traditional number crunching has long looked at lower- and mixed-precision hardware.

llm_nerd|3 months ago

It's just more common as a legacy artifact from when nvidia was basically the only option available. Many shops are designing models and functions, and then training and iterating on nvidia hardware, but once you have a trained model it's largely fungible. See how Anthropic moved their models from nvidia hardware to Inferentia to XLA on Google TPUs.

Further it's worth noting that the Ironwood, Google's v7 TPU, supports only up to BF16 (a 16-bit floating point that has the range of FP32 minus the precision. Many training processes rely upon larger types, quantizing later, so this breaks a lot of assumptions. Yet Google surprised and actually training Gemini 3 with just that type, so I think a lot of people are reconsidering assumptions.

qeternity|3 months ago

This is not the case for LLMs. FP16/BF16 training precision is standard, with FP8 inference very common. But labs are moving to FP8 training and even FP4.

imtringued|3 months ago

When training a neural network, you usually play around with the architecture and need as much flexibility as possible. You need to support a large set of operations.

Another factor is that training is always done with batches. Inference batching depends on the number of concurrent users. This means training tends to be compute bound where supporting the latest data types is critical, whereas inference speeds are often bottlenecked by memory which does not lend itself to product differentiation. If you put the same memory into your chip as your competitor, the difference is going to be way smaller.

Traster|3 months ago

Training is taking an enormous problem and trying to break it into lots of pieces and managing the data dependency between those pieces. It's solving 1 really hard problem. Inference is the opposite, it's lots of small independent problems. All of this "we have X many widgets connected to Y many high bandwidth optical telescopes" is all a training problem that they need to solve. Inference is "I have 20 tokens and I want to throw them at these 5,000,000 matrix multiplies, oh and I don't care about latency".

jeffbee|3 months ago

I can't think of any case where inference doesn't care about latency.

johnebgd|3 months ago

I think it’s the same reason windows is inportant to desktop computers. Software was written to depend on it. Same with most of the software out there today to train being built around CUDA. Even a version difference of CUDA can break things.

qcnguy|3 months ago

CUDA is just a better dev experience. Lots of training is experiments where developer/researcher productivity matters. Googlers get to use what they're given, others get to choose.

Once you settle on a design then doing ASICs to accelerate it might make sense. But I'm not sure the gap is so big, the article says some things that aren't really true of datacenter GPUs (Nvidia dc gpus haven't wasted hardware on graphics related stuff for years).

baby_souffle|3 months ago

That quote left me with the same question. Something about decent amount of ram on one board perhaps? That’s advantageous for training but less so for inference?

NaomiLehman|3 months ago

inference is often a static, bounded problem solvable by generic compilers. training requires the mature ecosystem and numerical stability of cuda to handle mixed-precision operations. unless you rewrite the software from the ground up like Google but for most companies it's cheaper and faster to buy NVIDIA hardware

never_inline|3 months ago

> static, bounded problem

What does it even mean in neural net context?

> numerical stability

also nice to expand a bit.