top | item 38078660

(no title)

sharms | 2 years ago

It looks like these models won't be as useful for LLM inference which are heavily memory bandwidth constrained. The Macbook Pro page shows M3 at 100GB/s,150GB/s, and 300GB/s vs M2 at 200GB/s and 400GB/s. 400GB/s is available for M3 if you opt for the high gpu config, but interesting to see it go down across all of these models.

discuss

order

minimaxir|2 years ago

The presentation mentioned dynamic GPU caching: that seems like something transformer models would like.

monocasa|2 years ago

Could be, but I'd like to hear more information about what it actually entails.

My gut feeling is that it's kind of like Z compression, but using the high amount of privileged software (basically a whole RTOS) they run on the GPU to dynamically allocate pages so that scare quotes "vram" allocations don't require giant arenas.

If that's the case, I'm not sure that ML will benefit. Most ML models are pretty good about actually touching everything they allocate, in which case, lazy allocations won't help you much and may actually get in the way startup latency.

brucethemoose2|2 years ago

In addition to what mono said, llama.cpp allocates everything up front with "--mlock"

Llama.cpp (and MLC) have to read the all the model weights from RAM for every token. Batching aside, there's no way around that.