WingNews

suprjami|1 day ago

The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.

If you want to spend twice as much for more speed, get a 3090/4090/5090.

If you want long context, get two of them.

If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.

barrkel|1 day ago

Rtx 6000 pro Blackwell, not ada, for 96GB.

chr15m|23 hours ago

Thanks this is a great summary of the tradeoffs!

dajonker|1 day ago

Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.

cyberax|1 day ago

I have a pair of Radeon AI PRO R9700 with 32Gb, and so far they have been a pleasure to use. Drivers work out-of-the-box, and they are completely quiet when unused. They are capped at 300W power, so even at 100% utilization they are not too loud.

I was thinking about adding after-market liquid cooling for them, but they're fine without it.

andsoitis|1 day ago

For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.

Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...

laweijfmvo|1 day ago

I never would have guessed that in 2026, data centers would be measured in Watts and desktop PCs measured in liters.

zozbot234|1 day ago

It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?

throwdbaaway|23 hours ago

For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.

xienze|1 day ago

It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.

rahimnathwani|1 day ago

There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom

I'm curious which one you're using.

msuniverse2026|1 day ago

I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?

elorant|1 day ago

Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.

CamperBob2|1 day ago

I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.

I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.

Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.

MarsIronPI|1 day ago

I've had good experience with GLM-4.7 and GLM-5.0. How would you compare them with Qwen 3.5? (If you have any experience with them.)

(no title)

discuss