This is only the base model, no upgrades yet for the Pro/Max version. The memory bandwidth is 153GB/s which is not enough to run viable open source LLM models properly.
153 GB/s is not bad at all for a base model; the Nvidia DGX Spark has only 273 GB/s memory bandwidth despite being billed as a desktop "AI supercomputer".
Models like Qwen 3 30B-A3B and GPT-OSS 20B, both quite decent, should be able to run at 30+ tokens/sec at typical (4-bit) quantizations.
Even at 1.8x the base memory bandwidth and 4x the memory capacity Nvidia spent a lot of time talking about how you can pair two DGXs together with the 200G NIC to be able to slowly run quantized versions of the models everyone was actually interested in.
Neither product actually qualifies for the task IMO, and that doesn't change just because two companies advertised them as such instead of just one. The absolute highest end Apple Silicon variants tend to be a bit more reasonable, but the price advantage goes out the window too.
Looks like the M5 base has LPDDR5x-9600, which works out to 153.6 from base M4's 120GB/s DDR5x-7500. The Pro/Max versions have more memory controllers, 16, 24 and 32 channels accordingly. The 32 channel M5 top-end version will have 614GB/s by my calculations.
It would take 48 channels of DDR5x-9600 to match a 3090's memory bandwidth, so the situation is unlikely to change for a couple of years when DDR6 arrives I guess
Yeah, that's my main bottleneck too. Constantly at 90%+ RAM utilization with my 64GiB (VMs, IDEs etc.). Hoping to go with at least 128GiB (or more) once M5 Max is released.
Models are made of "parameters" which are really weights in a large neural network. For each token generated, each parameter needs to take its turn inside the CPU/GPU to be calculated.
So if you have a 7B parameter model with 16-bit quantization, that means you'll have 14 GB/s of data coming in. If you only have 153 GB/sec of memory bandwidth, that means you'll cap out ~11 tokens/sec, regardless of how my processing power you have.
You can of course quantize to 8-bit or even 4-bit, or use a smaller model, but doing so makes your model dumber. There's a trade-off between performance and capability.
You might be interested in LLM Systems which talks about how LLMs work at the hardware level and what optimizations can be done to improve the efficiency of them in this course: https://llmsystem.github.io/llmsystem2025spring/
The models (weights and activations and caches) can fill all the memory you have and more, and to a first (very rough) approximation every byte needs to be accessed for each token generated. You can see how that would add up.
I highly recommend Andrej Karpathy's videos if you want to learn details.
Enough or not, they do describe it like this in an image caption:
"M5 is Apple’s next-generation system on a chip built for AI, resulting in a faster, more efficient, and more capable chip for the 14-inch MacBook Pro, iPad Pro, and Apple Vision Pro."
You don’t want to be bandwidth-bound, sure. But it all depends on how much compute power you have to begin with. 153GB/s is probably not enough bandwidth for an Rtx5090. But for the entry laptop/tablet chip M5? It’s likely plenty.
"Properly" means at some arbitrary speed that the writer would describe as "fast" or "fast enough". If you have a lower demand for speed they'll run fine.
wizee|4 months ago
Models like Qwen 3 30B-A3B and GPT-OSS 20B, both quite decent, should be able to run at 30+ tokens/sec at typical (4-bit) quantizations.
zamadatix|4 months ago
Neither product actually qualifies for the task IMO, and that doesn't change just because two companies advertised them as such instead of just one. The absolute highest end Apple Silicon variants tend to be a bit more reasonable, but the price advantage goes out the window too.
mrheosuper|4 months ago
replete|4 months ago
It would take 48 channels of DDR5x-9600 to match a 3090's memory bandwidth, so the situation is unlikely to change for a couple of years when DDR6 arrives I guess
mpeg|4 months ago
sgt|4 months ago
iyn|4 months ago
czbond|4 months ago
Sohcahtoa82|4 months ago
So if you have a 7B parameter model with 16-bit quantization, that means you'll have 14 GB/s of data coming in. If you only have 153 GB/sec of memory bandwidth, that means you'll cap out ~11 tokens/sec, regardless of how my processing power you have.
You can of course quantize to 8-bit or even 4-bit, or use a smaller model, but doing so makes your model dumber. There's a trade-off between performance and capability.
shorts_theory|4 months ago
modeless|4 months ago
I highly recommend Andrej Karpathy's videos if you want to learn details.
hu3|4 months ago
"M5 is Apple’s next-generation system on a chip built for AI, resulting in a faster, more efficient, and more capable chip for the 14-inch MacBook Pro, iPad Pro, and Apple Vision Pro."
diabllicseagull|4 months ago
chedabob|4 months ago
Tepix|4 months ago
quest88|4 months ago
burnte|4 months ago
nik736|4 months ago