top | item 44628517

(no title)

lhl | 7 months ago

Strix Halo does not run a 70B Q6 dense model at real-time conversational speed - it has a real-world MBW of about 210 GB/s. A 40GB Q4 will clock just over 5 tok/s. A Q6 would be slower.

It will run some big MoEs at a decent speed (eg, Llama 4 Scout 109B-A17B Q4 at almost 20 tok/s). The other issue is its prefill - only about 200 tok/s due to having only very under-optimized RDNA3 GEMMs. From my testing, you usually have to trade off pp for tg.

If you are willing to spend $10K for hardware, I'd say you are much better off w/ EPYC and 12-24 channels of DDR5, and a couple fast GPUS for shared experts and TFLOPS. But, unless you are doing all-night batch processing, that $10K is probably better spent on paying per token or even renting GPUs (especially when you take into account power).

Of course, there may be other reasons you'd want to inference locally (privacy, etc).

discuss

moffkalast|7 months ago

Yeah it's only really viable for chat use cases, coding is the most demanding in terms of generation speed, to keep the workflow usable it needs to spit out corrections in seconds, not minutes.

I use local LLMs as much as possible myself, but coding is the only use case where I still entirely defer to Claude, GPT, etc. because you need both max speed and bleeding edge model intelligence for anything close to acceptable results. When Qwen-3-Coder lands + having it on runpod might be a low end viable alternative, but likely still a major waste of time when you actually need to get something done properly.