This is a good time to promote running your own models. I have been running my own models locally and I would wager a local model will meet 85-95% of your needs if you really learn to use it. These models have gotten great. For anyone wanting to get into this, the smartest models to run recently that is consumer friendly was just released, checkout Qwen3.5 the 27B and 35B variants. They are small and I recommend running full Q8 quants. The easiest way to run these without dealing with complex GPU is to get a mac. For the example I gave, a 64gb mac will handle it well. If you are really cash strapped then you can manage with a 32gb but will have to run with less resolution quants. If you are not cashed strap, then get at least a 128gb and if possible a 256gb. The models are so good you will regret not getting a better system. You can join the r/LocalLlama community in reddit to learn some more. But this is pretty easy. Grab llama.cpp, grab a gguf quant from huggingface.co - the unsloth quants are great - https://huggingface.co/unsloth/models
0xbadcafebee|1 day ago
A laptop with an iGPU and loads of system RAM has the advantage of being able to use system ram in addition to VRAM to load models (assuming your gpu driver supports it, which most do afaik), so load up as much system RAM as you can. The downside is, the system RAM is less fast than dedicated GDDR5. These GPUs would be Radeon 890M and Intel Arc (previous generations are still decently good, if that's more affordable for you).
A laptop with a discrete GPU will not be able to load models as large directly to GPU, but with layer offloading and a quantized MoE model, you can still get quite fast performance with modern low-to-medium-sized models.
Do not get less than 32GB RAM for any machine, and max out the iGPU machine's RAM. Also try to get a bigass NVMe drive as you will likely be downloading a lot of big models, and should be using a VM with Docker containers, so all that adds up to steal away quite a bit of drive space.
Final thought: before you spend thousands on a machine, consider that there are at least a dozen companies that provide non-Anthropic/non-OpenAI models in the cloud, many of which are dirt cheap because of how fast and good open weights are now. Do the math before you purchase a machine; unless you are doing 24/7/365 inference, the cloud is fastly more cost effective.
bjackman|1 day ago
Oh yeah, seems obvious now you said it, but this is a great point.
I'm constantly thinking "I need to get into local models but I dread spending all that time and money without having any idea if the end result would be useful".
But obviously the answer is to start playing with open models in the cloud!
asymmetric|1 day ago
Do you have some links?
Also I assume the privacy implications are vastly different compared to running locally?
winternewt|1 day ago
timschmidt|1 day ago
Power consumption? Don't ask. A subscription is cheaper.
zepearl|1 day ago
My performance when using an RTX 5070 12GiB VRAM, Ryzen 7 9700X 8 cores CPU, 32GiB DDR5 6000MT (2 sticks):
So currently my sweet spot is "qwen3:30b-a3b" - even if the model doesn't completely fit on the GPU it's still fast enough. "qwen3.5" was disappointing so far, but maybe things will change in the future (maybe Ollama needs some special optimizations for the 3.5-series?).I would therefore deduce that the most important thing is the amount of VRAM and that performance would be similar even when using an older GPU (e.g. an RTX 3060 with as well 12GiB RAM)?
Performance without a GPU, tested by using a Ryzen 9 5950X 16 cores CPU, 128GiB DDR4 3200 MT:
siquick|1 day ago
Cold boot times are around 5m but if your usage periods are predictable it can work out ok. Works out at $2 an hour.
Still far more expensive than a ChatGPT sub.
segmondy|1 day ago
Keyframe|1 day ago
I'd even come from another angle.. What are my options if I want a decent coding agent, on the level of what Claude does at any given price? Let's say few tens of thousands of dollars? I've had a limited look at what's available to be run locally and nothing is on par.
atwrk|1 day ago
khalic|1 day ago
But right now, a Mac is the easiest way because of their memory architecture.
am17an|1 day ago
AussieWog93|1 day ago
That said, last time I tried local LLMs (around when gpt-oss came out) it still seemed super gimmicky (or at least niche, I could imagine privacy concerns would be a big deal for some). Very few use cases where you want an LLM but can't benefit immensely from using SOTA models like Claude Opus.
asmor|1 day ago
As much as I love owning my stack, you'd have to use so much of this to break even vs an inference provider/aggregator with open frontier-ish models. (and personally, I want to use as little as possible)
computerex|1 day ago
giancarlostoro|1 day ago
Also, because Apple in their infinite wisdom despite giving you a fan, very lazily turn it on (I swear it has to hit 100c before it comes on) and they give you zero control over fan settings, you may want to snag something like TG Pro for the Mac. I wound up buying a license for it, this lets you define at which temperature you want to run your fans and even gives you manual control.
On my 24G RAM Macbook Pro I have about 16GB of Inference. I use Zed with LM Studio as the back-end. I primarily just use Claude Code, but as you note, I'm sure if I used a beefier Mac with more RAM I could probably handle way more.
There's a few models that are interesting on the Mac with LM Studio that let you call tooling, so it can read your local files and write and such:
mistralai/mistralai-3-3b this one's 4.49GB - So I can increase my context window for it, not sure if it auto-compacts or not, have only just started testing it
zai-org/glm-4.6v-flash - This one is 7.09GB, same thing, only just started testing it.
mistralai/mistral-3-14b-reasoning - This one is 15.2GB just shy of the max, so not a TON of wiggle room, but usable.
If you're Apple or a company that builds things for Macs or other devices, please build something to help with airflow / cooling for the MBP / Mac Mini, it feels ridiculous that it becomes a 100c device I'm not so sure its great for device health if you want to use inference for longer than the norm.
I will probably buy a new Mac whenever the inference speeds increase at a dramatic enough rate. I sure hope Apple is considering serious options for increasing inference speed.
duskwuff|1 day ago
hypercube33|1 day ago
gambiting|1 day ago
I have a base model M4 Mac Mini and it absolutely does have a fan inside it.
elorant|1 day ago
ddxv|1 day ago
I've wanted to try some of the more recent 8B models for local tab completion or agentic, any experience with those kinds of smaller models?
lioeters|1 day ago
So far I'm using it conversationally, and scripting with tools. I wrote a simple chat interface / REPL in the terminal. But it's not integrated with code editor, nor agentic/claw-like loops. Last time I tried an open-source Codex-like thing, a popular one but I forget its name, it was slow and not that useful for my coding style.
It took some practice but I've been able to get good use out of it, for learning languages (human and programming), translation, producing code examples and snippets, and sometimes bouncing ideas like a rubber-duck method.
segmondy|1 day ago
ZYZ64738|1 day ago
untested:
https://github.com/xaskasdf/ntransformer
setopt|1 day ago
mcv|1 day ago
I haven't tried pure text models, but 27B sounds painful for my system.
drivebyhooting|1 day ago
Macuyiko|1 day ago
segmondy|1 day ago
2001zhaozhao|1 day ago
magicalhippo|1 day ago
However models respond very differently, and there are tricks you can do like limiting quantization of certain layers. Some models can genrally behave fine down into sub-Q4 territory, while others don't do well below Q8 at all. And then you have the way it was quantized on top of that.
So either find some actual benchmarks, which can be rare, or you just have to try.
As an example, Unsloth recently released some benchmarks[1] which showed Qwen3.5 35B tolerating quantization very well, except for a few layers which was very sensitive.
edit: Unsloth has a page detailing their updated quantization method here[2], which was just submitted[3].
[1]: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
[2]: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
[3]: https://news.ycombinator.com/item?id=47192505
segmondy|1 day ago
you can always try evals and see if you have a q6 or q4 that can perform better than your q8. for smaller models i go q8. for bigger ones when i run out of memory I then go q6/q6/q4 and sometimes q3. i run deepseek/kimi-q4 for example.
I suggest for beginners to start with q8 so they can get the best quality and not be disappointed. it's simple to use q8 if you have the memory, choice fatigue and confusion comes in once you start trying to pick other quants...
unmole|1 day ago
bjackman|1 day ago
I have always taken plenty of care to try and avoid becoming dependent on big tech for my lifestyle. Succeeded in some areas failed in others.
But now AI is a part of so many things I do and I'm concerned about it. I'm dependent on Android but I know with a bit of focus I have a clear route to escape it. Ditto with GMail. But I don't actually know what I'd do tomorrow if Gemini stopped serving my needs.
I think for those of us that _can_ afford the hardware it is probably a good investment to start learning and exploring.
One particular thing I'm concerned about is that right now I use AI exclusively through the clients Google picked for me, coz it makes financial sense. (You don't seem to get free bubble money if you buy tokens via API billing, only consumer accounts). This makes me a bit of a sheep and it feels bad. There's so much innovation happening and basically I only benefit from it in the ways Google chooses.
(Admittedly I don't need local models to fix that particular issue, maybe I should just start paying the actual cost for tokens).
AussieWog93|1 day ago
The cash burn comes from models ballooning in size - they spend (as an example, not actual numbers) 100M on training + inference for the lifetime of Sonnet 3.5, make 200M from subscriptions/api keys while it's SOTA, but then have to somehow come up with 1B to train Opus 4.0.
To run some other back of the envelope calcs: GLM 4.7 Air (previous "good" local LLM) can generate ~70 tok/s on a Mac Mini. This equates to 2,200 million tokens per year.
Openrouter charge $0.40 per million tokens, so theoretically if you were using that Mac mini at 100% utilisation you'd be generating $880 per annum "worth" of API usage.
Assuming a power draw of something 50W, you're only looking at 440kWh per annum. At 20c per kWh that's $90 on power, plus $499 to get the hardware itself. Depreciate that $499 hardware cost over 3 years and you're looking at ~$260 to generate ~$880 in inference income.
segmondy|1 day ago
ZenoArrow|1 day ago