top | item 47201333

(no title)

mstaoru | 1 day ago

I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.

So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.

Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.

Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.

discuss

order

lm28469|1 day ago

> Wonder what am I doing wrong?

You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus

Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine

vlovich123|1 day ago

The hardware difference explains runtime performance differences, not task performance.

Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences

shlomo_z|20 hours ago

I'll add, AI Labs put a lot of resources into allowing the AI to search the web.. that makes a big difference

delaminator|1 day ago

Looks at the headline: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

aspenmartin|1 day ago

Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5

zozbot234|1 day ago

> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.

But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.

adam_patarino|14 hours ago

The biggest gaps are not in hardware or model size. There is a lot of logical fallacy in the industry. Most people believe bigger is better. For model size, compute, tools, etc.

The reality in ML is that small models can perform better at a narrow problem set than large ones.

The key is the narrow problem set. Opus can write you a poem, create a shopping list, and analyze your massive code base.

We trained our model to only focus on coding with our specific agent harness, tools, and context engine. And it’s small enough to fit on an M2 16GB. It’s as good as sonnet 4.5 and way better than qwen3.5:35b-a3b

Our beta will be out soon / rig.ai

amritananda|2 hours ago

No benchmarks, no information about training methods/datasets, template placeholder vibe-coded website. Waste of time.

wolvoleo|1 day ago

Well first of all you're running a long intense task on a thermally constrained machine. Your MacBook Pro is optimised for portability and battery life, not max performance under load. And apple's obsession with thinness overrules thermal performance for them. Short peaks will be ok but a 45 minute task will thoroughly saturate the cooling system.

Even on servers this can happen. At work we have a 2U sized server with two 250W class GPUs. And I found that by pinning the case fans at 100% I can get 30% more performance out of GPU tasks which translates to several days faster for our usecase. It does mean I can literally hear the fans screaming in the hallway outside the equipment room but ok lol. Who cares. But a laptop just can't compare.

Something with a desktop GPU or even better something with HBM3 would run much better. Local models get slow when you use a ton of context and the memory bandwidth of a MacBook Pro while better than a pc is still not amazing.

And yeah the heaviest tasks are not great on local models. I tend to run the low hanging fruit locally and the stuff where I really need the best in the cloud. I don't agree local models are on par, however I don't think they really need to be for a lot of tasks.

pamcake|1 day ago

To your point, one can get a great performance boost by propping the laptop onto a roost-like stand in front of a large fan. Nothing like a cooling system actually built for sustained load but still.

meatmanek|23 hours ago

I've seen reports of qwen3.5-35b-a3b spending a ton of time reasoning if the context window is nearly empty-- supposedly it reasons less if you provide a long system prompt or some file contents, like if you use it in a coding agent.

I'm too GPU-poor to run it, but r/LocalLLaMa is full of people using it.

boutell|15 hours ago

Can confirm. I gave it a variant of the car wash question on a MacBook M4 with 32 GB of RAM. It produced output at a conversational speed, sure, but that started with 6 minutes of thinking output. 6 minutes.

On the plus side, it did figure out the question even without the first sentence that's intended as a bit of a giveaway.

regularfry|13 hours ago

There's definitely something wrong with the thinking mode on this one. I wouldn't be surprised if it gets fixed, either by qwen themselves or with a fine-tune.

__mharrison__|1 day ago

Were you using mlx-lm? I've had good performance with that on Macs. (Sadly, the lead developer just left Apple.)

Admittedly, I haven't tried these models on my Mac, but I have on my DGX Spark, and they ran fine. I didn't see the slowdown you're mentioning.

mstaoru|18 hours ago

(I think) yes, via the latest openwebui + ollama.

zozbot234|1 day ago

Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.

stavros|1 day ago

I can never see the point, though. Performance isn't anywhere near Opus, and even that gets confused following instructions or making tool calls in demanding scenarios. Open weights models are just light years behind.

I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.

mstaoru|11 hours ago

So it's back to the original question, why spend $5-10k on the Studio, when it will still be 10x slower and half the intelligence vs. $20 Sonnet?.. What is the point (besides privacy) to use local models now for coding?

PS: I can understand that isolated "valuable" problems like sorting photo collection or feeding a cat via ESPHome can be solved with local models.

wat10000|1 day ago

I have a laptop already, so that's what I'm going to use.

xtn|23 hours ago

I think knowledge of frontier research certainly scale with number of parameters. Also, US labs can pay more money to have researchers provide training data on these frontier research areas.

On the other hand, if indeed open source models and Macbooks can be as powerful as those SOTA models from Google, etc, then stock prices of many companies would already collapsed.

muyuu|20 hours ago

Depending on the specificity of the research, having a model with fewer parameters will come with a higher penalty. If you want a model to perform better at something specific while staying smaller, generally it will take specific training to achieve that.

notreallya|1 day ago

Sonnet 4.5 level isn't Opus 4.6 level, simple as

culi|1 day ago

Well you can't run Gemini Pro or Opus 4.6 locally so are you comparing a locally run model to cloud platforms?

holoduke|20 hours ago

Your Gemini or Opus question got send to a Texas datacenter where it got queued and processed by a subunit of 80 h200 140gb 1000w cards running a many billion or trillion parameter model. It took less that 200ms to process a single request. Your Claude cliënt decided to spawn 30 sub agents and iterated in a total of 90 requests totalling about 45000ms. Now compare that to your 100b transistor cpu doing something similar. Yes that would be slow.

mstaoru|18 hours ago

Right, it was more of a rhetorical question :) With my point being - how are these local models really useful to me now? Is the Only Way ™ to sell my house and build a 8x5090 monster?.. How does that compare to $20/month Opus? (Privacy aside.)

The second order thought from this is... will we get a value-based price leveling soon? If the alternative to a hosted LLM is to build $10-20k+ machine with $500+ monthly energy bills, will hosted price asymptotically climb up to reflect this reality?

Something to think about.

rienko|1 day ago

use a larger model like Qwen3.5-122B-A10B quantized to 4/5/6 bits depending on how much context you desire, MLX versions if you want best tok/s on Mac HW.

if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.

furyofantares|1 day ago

Can you try asking Sonnet 4.5 the same question, since that is what this model is claimed to be on par with?

andxor|1 day ago

You're not doing anything wrong. The Chinese models are not as good as advertised. Surprise surprise!

gigatexal|23 hours ago

I have the exact same hardware. Was going to do the same thing with the 122B model … I’ll just keep paying Anthropic and he models are just that good. Trying out Gemini too. But won’t pay OpenAI as they’re going to be helping Pete Kegseth to develop autonomous killing machines.

CamperBob2|1 day ago

Try the 27B dense model. It will likely do much better than the 35b MoE with only 3B active experts.

Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.

regularfry|13 hours ago

Currently sat waiting for the unsloth fixed quants to drop, but I'm on the edge of my seat for this.