I've got the unsloth q4_K_XL 35b running in llama.cpp on an i9/64G/4090 machine doing double-digit tokens per second with a 90k+ token context window available. The model's completely in VRAM.
It is slow but usable via opencode on a mbp m3 max 48 gb. So I guess hosted is still the better option for most people.
The local models are considerably better relative to the hosted ones compared to 6 months ago. Bench maxing or not - stuff is happening in this area for sure.
regularfry|13 hours ago
chvid|19 hours ago
The local models are considerably better relative to the hosted ones compared to 6 months ago. Bench maxing or not - stuff is happening in this area for sure.