top | item 42867650

(no title)

samvher | 1 year ago

Given what we just saw in terms of the DeepSeek team squeezing a lot of extra performance out of more efficient implementation on GPU, and the model still being optimized for GPU rather than CPU - is it unreasonable to think that in the $6k setup described, some performance might still be left on the table that could be squeezed out with some better optimization for these particular CPUs?

discuss

ryao|1 year ago

The answer to your question is yes. There is an open issue with llama.cpp about this very thing:

https://github.com/ggerganov/llama.cpp/issues/11333

The TLDR is that llama.cpp’s NUMA support is suboptimal, which is hurting performance versus what it should be on this machine. A single socket version likely would perform better until it is fixed. After it is fixed, a dual socket machine would likely run at the same speed as a single socket machine.

If someone implemented a GEMV that scales with NUMA nodes (i.e. PBLAS, but for the data types used in inference), it might be possible to get higher performance from a dual socket machine than we get from a single socket machine.

snovv_crash|1 year ago

No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.

menaerus|1 year ago

How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.

telotortium|1 year ago

Maybe a little, but FLOPs and memory bandwidth don't lie.