(no title)
lifeinthevoid | 5 months ago
I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.
lifeinthevoid | 5 months ago
I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.
imiric|5 months ago
The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.
The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.
I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.
elsombrero|5 months ago
I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.
this is my snippet for llama-swap
``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```
ThatPlayer|5 months ago
Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running
mycall|5 months ago
or ~2.2M tk/day. This is how we should be thinking about it imho.
NicoJuicy|5 months ago
It's pretty good.
naabb|5 months ago
jszymborski|5 months ago
nenenejej|5 months ago
Our_Benefactors|5 months ago