(no title)
roosgit | 1 year ago
I have an RTX 3060(12GB) and 32GB RAM. Just ran Qwen2.5-14B-Instruct-Q4_K_M.gguf in llama.cpp with flash attention enabled and 8K context. I get get 845t/s for prompt processing and 25t/s for generation.
For a while I even ran llama.cpp without a GPU (don't recommend it for diffusion) and with the same model (Qwen2.5 14B) I would get 11t/s for processing and 4t/s for generation. Acceptable for chats with short questions/instructions and answers.
No comments yet.