When running on apple silicon you want to use mlx, not llama.cpp as this benchmark does. Performance is much better than what's plotted there and seems to be getting better, right?
Power consumption is almost 10x smaller for apple.
Vram is more than 10x larger.
Price wise for running same size models apple is cheaper.
Upper limit (larger models, longer context) is far larger for apple (for nvidia you can easily put 2x cards, more than that it becomes whole complex setup no ordinary person can do).
Am I missing something or apple is simply currently better for local llms?
there is a plateau where you simply need more compute and the m4 cores are not enough, so even if they have enough ram for the model the token/s is not useful
I'm trying to find out about that as well as I'm considering a local LLM for some heavy prototyping. I don't mind which HW I buy, but it's on a relative budget and energy efficiency is also not a bad thing. Seems the Ultra can do 40 tokens/sec on DeepSeek and nothing even comes close at that price point.
You are missing something. This is a single stream of inference. You can load up the Nvidia card with at least 16 inference streams and get at much higher throughout tokens/sec.
This just is just a single user chat experience benchmark.
mirekrusin|11 months ago
Power consumption is almost 10x smaller for apple.
Vram is more than 10x larger.
Price wise for running same size models apple is cheaper.
Upper limit (larger models, longer context) is far larger for apple (for nvidia you can easily put 2x cards, more than that it becomes whole complex setup no ordinary person can do).
Am I missing something or apple is simply currently better for local llms?
hnfong|11 months ago
nicman23|11 months ago
sgt|11 months ago
Tostino|11 months ago
This just is just a single user chat experience benchmark.