top | item 47194446

(no title)

roxolotl | 1 day ago

What method are you using to do that? I’ve been playing with llama.cpp a lot lately and trying to figure out the cleanest options for getting a solid context window on 32gb vram and 64gb system ram.

discuss

jychang|1 day ago

32GB vram is more than enough for Qwen 3.5 35b

You can just load the Q4_K_XL model like normal, and put all tensors on GPU without any -ot or --cpu-moe flags.

If you need a massive context for some reason where model+kv cache won't fit in 32gb, then use -ot to move the ffn moe experts for 1-2 layers into RAM. You'll get a speed hit (due to loading params from slower RAM instead of fast VRAM) but it'll work.

roxolotl|1 day ago

Nice ok I’ll play with that. I’m mostly just learning what’s possible. Qwen 3.5 35b has been great without any customizations but it’s interesting to learn what the options are.