Same? Not quite as good as that. But google’s Gemma 3 27B is highly similar to their last Flash model. The latest Qwen3 variants are very good, to my need at least they are the best open coders, but really— here’s the thing:
There’s so many varieties, specialized to different tasks or simply different in performance.
Maybe we’ll get to a one-size fits all at some point, but for now trying out a few can pay off. It also starts to build a better sense of the ecosystem as a whole.
For running them: if you have an Nvidia GPU w/ 8GB of vram you’re probably able to run a bunch— quantized. It gets a bit esoteric when you start getting into quantization varieties but generally speaking you should find out the sort of integer & float math your gpu has optimized support for and then choose the largest quantized model that corresponds to support and still fits in vram. Most often that’s what will perform the best in both speed and quality, unless you need to run more than 1 model at a time.
To give you a reference point on model choice, performance, gpu, etc: one of my systems runs with an nvidia 4080 w/ 16GB VRAM. Using Qwen 3 Coder 30B, heavily quantized, I can get about 60 tokens per second.
I get tolerable performance out of a quantized gpt-oss 20b on an old RTX3050 I have kicking around (I want to say 20-30 tokens/s, or faster when cache is effective). It's appreciably faster on the 4060. It's not quite ideal for more interactive agentic coding on the 3050, but approaching it, and fitting nicely as a "coding in the background while I fiddle on something else" territory.
The run at home was in the context of $2k/mo. At that price you can get your money back on self-hosted hardware at a much more reasonable pace compared to 20/mo (or even 200).
Well theres an open source GPT model you can run locally. I dont think running models locally is all that cheap considering top of the line GPUs used to be $300 now you are lucky if you get the best GPU for under $2000. The better models require a lot more VRAM. Macs can run them pretty decently but now you are spending $5000 plus you could have just bought a rig with a 5090 with mediocre desktop ram because Sam Altman has ruined the RAM pricing market.
I got some decent mileage out of aider and Gemma 27B. The one shot output was a little less good, but I don’t have to worry about paying per token or hitting plan limits so I felt more free to let it devise a plan, run it in a loop, etc.
Not having to worry about token limits is surprisingly cognitively freeing. I don’t have to worry about having a perfect prompt.
ineedasername|15 days ago
There’s so many varieties, specialized to different tasks or simply different in performance.
Maybe we’ll get to a one-size fits all at some point, but for now trying out a few can pay off. It also starts to build a better sense of the ecosystem as a whole.
For running them: if you have an Nvidia GPU w/ 8GB of vram you’re probably able to run a bunch— quantized. It gets a bit esoteric when you start getting into quantization varieties but generally speaking you should find out the sort of integer & float math your gpu has optimized support for and then choose the largest quantized model that corresponds to support and still fits in vram. Most often that’s what will perform the best in both speed and quality, unless you need to run more than 1 model at a time.
To give you a reference point on model choice, performance, gpu, etc: one of my systems runs with an nvidia 4080 w/ 16GB VRAM. Using Qwen 3 Coder 30B, heavily quantized, I can get about 60 tokens per second.
Twirrim|15 days ago
saratogacx|15 days ago
giancarlostoro|14 days ago
Our_Benefactors|14 days ago
everforward|14 days ago
Not having to worry about token limits is surprisingly cognitively freeing. I don’t have to worry about having a perfect prompt.
joquarky|15 days ago
colonCapitalDee|15 days ago