In my experience with llama.cpp using the CPU (on Linux) is very slow compared to GPU or NPU with the same models as my M1 MacBook Pro using Metal (or maybe it's the shared memory allowing the speedup?).
Even with 12 threads of my 5900X (I've tried using the full 24 SMT - that doesn't really seem to help) with the dolphin-2.5-mixtral-8x7b.Q5_K_M model, my MacBook Pro is around 5-6x faster in terms of tokens per second...
Seems to be around 3 tokens/s on my laptop, which is faster than average human, but not too fast of course.
On a desktop with mid-range GPU used for offloading, I can get around 12 tokens/s, which is plenty fast for chatting.
ilaksh|2 years ago
berkut|2 years ago
Even with 12 threads of my 5900X (I've tried using the full 24 SMT - that doesn't really seem to help) with the dolphin-2.5-mixtral-8x7b.Q5_K_M model, my MacBook Pro is around 5-6x faster in terms of tokens per second...
TheMatten|2 years ago