top | item 42981916

(no title)

roosgit | 1 year ago

I can answer question 3. Prompt processing (how fast your input is parsed) is highly correlated with computing speed. Inference (how fast the LLM answers) is highly correlated with memory bandwidth. So a good CPU might read your question faster, but it will answer pretty much as slow as a cheap CPU with the same RAM.

I have a Ryzen 3 4100. Just tested Qwen2.5-Coder-32B-Instruct-Q3_K_S.gguf with llama.cpp.

CPU-only:

54.08 t/s prompt eval

2.69 t/s inference

---

CPU + 52/65 layers offloaded to GPU (RTX 3060 12GB):

166.79 t/s prompt eval

6.62 t/s inference

discuss

No comments yet.