top | item 43632574

(no title)

nehalem | 10 months ago

Not knowing much about special-purpose chips, I would like to understand whether chips like this would give Google a significant cost advantage over the likes of Anthropic or OpenAI when offering LLM services. Is similar technology available to Google's competitors?

discuss

order

heymijo|10 months ago

GPUs, very good for pretraining. Inefficient for inference.

Why?

For each new word a transformer generates it has to move the entire set of model weights from memory to compute units. For a 70 billion parameter model with 16-bit weights that requires moving approximately 140 gigabytes of data to generate just a single word.

GPUs have off-chip memory. That means a GPU has to push data across a chip - memory bridge for every single word it creates. This architectural choice, is an advantage for graphics processing where large amounts of data needs to be stored but not necessarily accessed as rapidly for every single computation. It's a liability in inference where quick and frequent data access is critical.

Listening to Andrew Feldman of Cerebras [0] is what helped me grok the differences. Caveat, he is a founder/CEO of a company that sells hardware for AI inference, so the guy is talking his book.

[0] https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...

latchkey|10 months ago

Cerebras (and Groq) has the problem of using too much die for compute and not enough for memory. Their method of scaling is to fan out the compute across more physical space. This takes more dc space, power and cooling, which is a huge issue. Funny enough, when I talked to Cerebras at SC24, they told me their largest customers are for training, not inference. They just market it as an inference product, which is even more confusing to me.

I wish I could say more about what AMD is doing in this space, but keep an eye on their MI4xx line.

ein0p|10 months ago

Several incorrect assumptions in this take. For one thing, 16 bit is not necessary. For another 140GB/token holds only if your batch size is 1 and your sequence length is 1 (no speculative decoding). Nobody runs LLMs like that on those GPUs - if you do it like that, compute utilization becomes ridiculously low. With batch of greater than 1 and speculative decoding arithmetic intensity of the kernels is much higher, and having weights "off chip" is not that much of a concern.

hanska|10 months ago

The Groq interview was good too. Seems that the thought process is that companies like Groq/Cerebras can run the inference, and companies like Nvidia can keep/focus on their highly lucrative pretraining business.

https://www.youtube.com/watch?v=xBMRL_7msjY

avrionov|10 months ago

NVIDIA operates at 70% profit right now. Not paying that premium and having alternative to NVIDIA is beneficial. We just don't know how much.

kccqzy|10 months ago

I might be misremembering here, but Google's own AI models (Gemini) don't use NVIDIA hardware in any way, training or inference. Google bought a large number of NVIDIA hardware only for Google Cloud customers, not themselves.

xnx|10 months ago

Google has a significant advantage over other hyperscalers because Google's AI data centers are much more compute cost efficient (capex and opex).

claytonjy|10 months ago

Because of the TPUs, or due to other factors?

What even is an AI data center? are the GPU/TPU boxes in a different building than the others?

cavisne|10 months ago

Nvidia has ~60% margins in their datacenter chips. So TPU's have quite a bit of headroom to save google money without being as good as Nvidia GPU's.

No one else has access to anything similar, Amazon is just starting to scale their Trainium chip.

buildbot|10 months ago

Microsoft has the MAIA 100 as well. No comment on their scale/plans though.

baby_souffle|10 months ago

There are other ai/llm ‘specific’ chips out there, yes. But the thing about asics is that you need one for each *specific* task. Eventually we’ll hit an equilibrium but for now, the stuff that Cerebras is best at is not what TPUs are best at is not what GPUs are best at…

monocasa|10 months ago

I don't even know if eventually we'll hit an equilibrium.

The end of Moore's law pretty much dictates specialization, it's just more apparent in fields without as much ossification first.