Given what we just saw in terms of the DeepSeek team squeezing a lot of extra performance out of more efficient implementation on GPU, and the model still being optimized for GPU rather than CPU - is it unreasonable to think that in the $6k setup described, some performance might still be left on the table that could be squeezed out with some better optimization for these particular CPUs?
ryao|1 year ago
https://github.com/ggerganov/llama.cpp/issues/11333
The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.
If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.
snovv_crash|1 year ago
menaerus|1 year ago
telotortium|1 year ago