top | item 45344539

(no title)

I built a similar system, meanwhile I've sold one of the RTX 3090's. Local inference is fun and feels liberating, but it's also slow, and once I was used to the immense power of the giant hosted models, the fun quickly disappeared.

I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.

discuss

imiric|5 months ago

I have a similar setup as the author with 2x 3090s.

The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.

The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.

elsombrero|5 months ago

On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.

I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.

this is my snippet for llama-swap

``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```

ThatPlayer|5 months ago

> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running

mycall|5 months ago

> 20-30 tk/s

or ~2.2M tk/day. This is how we should be thinking about it imho.

NicoJuicy|5 months ago

If you have a 24 gb 3090. Try out qwen:30b-a3b-instruct-2507-q4_K_M ( ollama )

It's pretty good.

naabb|5 months ago

tbf I also run that on a 16GB 5070TI at 25T/S, it's amazing how fast it runs on consumer grade hardware. I think you could push up to a bigger model but I don't know enough about local llama.

jszymborski|5 months ago

Don't need a 3090, it runs really fast on an RTX 2080 too.

nenenejej|5 months ago

Graphics cards are so expensive (list price) they are cheap (no depreciation liquid market)

Our_Benefactors|5 months ago

Did you really claim GPUs have zero depreciation? That’s obviously false.