Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

Aurornis|22 hours ago

If you're new to this: All of the open source models are playing benchmark optimization games. Every new open weight model comes with promises of being as good as something SOTA from a few months ago then they always disappoint in actual use.

I've been playing with Qwen3-Coder-Next and the Qwen3.5 models since they were each released.

They are impressive, but they are not performing at Sonnet 4.5 level in my experience.

I have observed that they're configured to be very tenacious. If you can carefully constrain the goal with some tests they need to pass and frame it in a way to keep them on track, they will just keep trying things over and over. They'll "solve" a lot of these problems in the way that a broken clock is right twice a day, but there's a lot of fumbling to get there.

That said, they are impressive for open source models. It's amazing what you can do with self-hosted now. Just don't believe the hype that these are Sonnet 4.5 level models because you're going to be very disappointed once you get into anything complex.

kir-gadjello|21 hours ago

Respectfully, from my experience and a few billions of tokens consumed, some opensource models really are strong and useful. Specifically StepFun-3.5-flash https://github.com/stepfun-ai/Step-3.5-Flash

I'm working on a pretty complex Rust codebase right now, with hundreds of integration tests and nontrivial concurrency, and stepfun powers through.

I have no relation to stepfun, and I'm saying this purely from deep respect to the team that managed to pack this performance in 196B/11B active envelope.

lend000|16 hours ago

Yes and no. "Last-gen" (like, from 6 months ago) frontier models do still tend to outperform the best open source models. But some models, especially GLM-5, really have captured whatever circuitry drives pattern matching in the models they were trained off of.

I like this benchmark that competes models against one another in competitive environments, which seems like it can't really be gamed: https://gertlabs.com

wolvoleo|21 hours ago

All models are doing that. Not only the open source ones.

I bet the cloud ones are doing it a lot more because they can also affect the runtime side which the open source ones can't.

dimgl|19 hours ago

I'm using Qwen 3.5 27b on my 4090 and let me tell you. This is the first time I am seriously blown away by coding performance on a local model. They are almost always unusable. Not this time though...

rudhdb773b|20 hours ago

Are there any up-to-date offline/private agentic coding benchmark leaderboards?

If the tests haven't been published anywhere and are sufficiently different from standard problems, I would think the benchmarks would be robust to intentional over optimization.

Edit: These look decent and generally match my expectations:

https://www.apex-testing.org/

chaboud|20 hours ago

"When a measure becomes a target, it ceases to be a good measure."

Goodhart's law shows up with people, in system design, in processor design, in education...

Models are going to be over-fit to the tests unless scruples or practical application realities intervene. It's a tale as old as machine learning.

warpspin|10 hours ago

Hmm, I second this. Haven't compared Qwen3.5 122B yet, but played around with OpenCode + Qwen3-Coder-Next yesterday and did manual comparisons with Claude Code and Claude Code is still far ahead in general felt "intelligence quality".

crystal_revenge|20 hours ago

> they always disappoint in actual use.

I’ve switched to using Kimi 2.5 for all of my personal usage and am far from disappointed.

Aside from being much cheaper than the big names (yes, I’m not running it locally, but like that I could) it just works and isn’t a sycophant. Nice to get coding problems solved without any “That’s a fantastic idea!”/“great point” comments.

At least with Kimi my understanding is that beating benchmarks was a secondary goal to good developer experience.

ekjhgkejhgk|8 hours ago

I've been trying to get these things to local host and use tools. Am I right in understanding that it's impossible for these things to use tools from within llama.cpp? Do I need another "thing" to run the models? What exactly is the mechanism by which the models became aware that they're somewhere where they have tools availbale? So many questions...

amelius|22 hours ago

Are you saying that the benchmarks are flawed?

And could quantization maybe partially explain the worse than expected results?

noosphr|21 hours ago

It's not just the open source ones.

The only benchmarks worth anything are dynamic ones which can be scaled up.

baq|12 hours ago

they're distilling claude and openai obviously.

that said, sonnet 4.5 is not a good model today, March 1st 2026. (it blew my mind on its release day, September 29th, 2025.)

ekianjo|18 hours ago

> That said, they are impressive for open source models.

there is nothing open "source" about them. They are open weights, that's all.

eurekin|21 hours ago

Very good point. I'm playing with them too and got to the same conclusion.

jackblemming|21 hours ago

Death by KPIs. Management makes it too risky to do anything but benchmaxx. It will be the death of American AI companies too. Eventually, people will notice models aren’t actually getting better and the money will stop flowing. However, this might be a golden age of research as cheap GPUs flood the market and universities have their own clusters.

bourjwahwah|21 hours ago

[deleted]

mstaoru|23 hours ago

I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.

So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.

Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.

Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.

lm28469|23 hours ago

> Wonder what am I doing wrong?

You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus

Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine

aspenmartin|23 hours ago

Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5

adam_patarino|9 hours ago

The biggest gaps are not in hardware or model size. There is a lot of logical fallacy in the industry. Most people believe bigger is better. For model size, compute, tools, etc.

The reality in ML is that small models can perform better at a narrow problem set than large ones.

The key is the narrow problem set. Opus can write you a poem, create a shopping list, and analyze your massive code base.

We trained our model to only focus on coding with our specific agent harness, tools, and context engine. And it’s small enough to fit on an M2 16GB. It’s as good as sonnet 4.5 and way better than qwen3.5:35b-a3b

Our beta will be out soon / rig.ai

wolvoleo|21 hours ago

Well first of all you're running a long intense task on a thermally constrained machine. Your MacBook Pro is optimised for portability and battery life, not max performance under load. And apple's obsession with thinness overrules thermal performance for them. Short peaks will be ok but a 45 minute task will thoroughly saturate the cooling system.

Even on servers this can happen. At work we have a 2U sized server with two 250W class GPUs. And I found that by pinning the case fans at 100% I can get 30% more performance out of GPU tasks which translates to several days faster for our usecase. It does mean I can literally hear the fans screaming in the hallway outside the equipment room but ok lol. Who cares. But a laptop just can't compare.

Something with a desktop GPU or even better something with HBM3 would run much better. Local models get slow when you use a ton of context and the memory bandwidth of a MacBook Pro while better than a pc is still not amazing.

And yeah the heaviest tasks are not great on local models. I tend to run the low hanging fruit locally and the stuff where I really need the best in the cloud. I don't agree local models are on par, however I don't think they really need to be for a lot of tasks.

meatmanek|18 hours ago

I've seen reports of qwen3.5-35b-a3b spending a ton of time reasoning if the context window is nearly empty-- supposedly it reasons less if you provide a long system prompt or some file contents, like if you use it in a coding agent.

I'm too GPU-poor to run it, but r/LocalLLaMa is full of people using it.

__mharrison__|22 hours ago

Were you using mlx-lm? I've had good performance with that on Macs. (Sadly, the lead developer just left Apple.)

Admittedly, I haven't tried these models on my Mac, but I have on my DGX Spark, and they ran fine. I didn't see the slowdown you're mentioning.

zozbot234|23 hours ago

Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.

xtn|18 hours ago

I think knowledge of frontier research certainly scale with number of parameters. Also, US labs can pay more money to have researchers provide training data on these frontier research areas.

On the other hand, if indeed open source models and Macbooks can be as powerful as those SOTA models from Google, etc, then stock prices of many companies would already collapsed.

muyuu|15 hours ago

Depending on the specificity of the research, having a model with fewer parameters will come with a higher penalty. If you want a model to perform better at something specific while staying smaller, generally it will take specific training to achieve that.

notreallya|23 hours ago

Sonnet 4.5 level isn't Opus 4.6 level, simple as

holoduke|15 hours ago

Your Gemini or Opus question got send to a Texas datacenter where it got queued and processed by a subunit of 80 h200 140gb 1000w cards running a many billion or trillion parameter model. It took less that 200ms to process a single request. Your Claude cliënt decided to spawn 30 sub agents and iterated in a total of 90 requests totalling about 45000ms. Now compare that to your 100b transistor cpu doing something similar. Yes that would be slow.

culi|23 hours ago

Well you can't run Gemini Pro or Opus 4.6 locally so are you comparing a locally run model to cloud platforms?

rienko|22 hours ago

use a larger model like Qwen3.5-122B-A10B quantized to 4/5/6 bits depending on how much context you desire, MLX versions if you want best tok/s on Mac HW.

if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.

furyofantares|23 hours ago

Can you try asking Sonnet 4.5 the same question, since that is what this model is claimed to be on par with?

andxor|22 hours ago

You're not doing anything wrong. The Chinese models are not as good as advertised. Surprise surprise!

gigatexal|18 hours ago

I have the exact same hardware. Was going to do the same thing with the 122B model … I’ll just keep paying Anthropic and he models are just that good. Trying out Gemini too. But won’t pay OpenAI as they’re going to be helping Pete Kegseth to develop autonomous killing machines.

CamperBob2|22 hours ago

Try the 27B dense model. It will likely do much better than the 35b MoE with only 3B active experts.

Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.

Paddyz|19 hours ago

[deleted]

alexpotato|1 day ago

I recently wrote a guide on getting:

- llama.cpp

- OpenCode

- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)

working on a M1 MacBook Pro (e.g. using brew).

It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.

https://gist.github.com/alexpotato/5b76989c24593962898294038...

freeone3000|23 hours ago

We can also run LM Studio and get it installed with one search and one click, exposed through an OpenAI-compatible API.

kpw94|23 hours ago

On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)

On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.

I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?

robby_w_g|23 hours ago

Does your MBP have 32 GB of ram? I’m waiting on a local model that can run decently on 16 GB

copperx|1 day ago

How fast does it run on your M1?

jackcosgrove|20 hours ago

I am a total neophyte when it comes to LLMs, and only recently started poking around into the internals of them. The first thing that struck me was that float32 dimensions seemed very generous.

I then discovered what quantization is by reading a blog post about binary quantization. That seemed too good to be true. I asked Claude to design an analysis assessing the fidelity of 1, 2, 4, and 8 bit quantization. Claude did a good job, downloading 10,000 embeddings from a public source and computing a similarity score and correlation coefficient for each level of quantization against the float32 SoT. 1 and 2 bit quantizations were about 90% similar and 8 bit quantization was lossless given the precision Claude used to display the results. 4 bit was interesting as it was 99% similar (almost lossless) yet half the size of 8 bit. It seemed like the sweet spot.

This analysis took me all of an hour so I thought, "That's cool but is it real?" It's gratifying to see that 4 bit quantization is actually being used by professionals in this field.

deepsquirrelnet|20 hours ago

4-bit quantization on newer nvidia hardware is being supported in training as well these days. I believe the gpt-oss models were trained natively in MXFP4, which is a 4-bit floating point / e2m1 (2-exponent, 1 bit mantissa, 1 bit sign).

It doesn't seem terribly common yet though. I think it is challenging to keep it stable.

[1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...

[2] https://www.opencompute.org/documents/ocp-microscaling-forma...

regularfry|8 hours ago

There's also work on ternary models that's quite interesting, because the arithmetic operations are super fast and they're extremely cache efficient. Well worth looking into if that's the sort of thing that interests you.

silisili|19 hours ago

Mind sharing any resources? I've been thinking about trying to understand them better myself.

tymscar|20 hours ago

Thats cool.

I do wonder where that extra acuity you get from 1% more shows up in practice. I hate how I have basically no way to intuitively tell that because of how much of a black box the system is

pram|13 hours ago

I decided to try Qwen3.5 122B in LM Studio with Opencode and I am impressed. It's not super slow (M4 Max/128GB) and it's pretty close to how Claude Code feels. Getting pretty good code analysis, definitely feels Sonnet-esque. I'm hyped completely local alternatives are getting so good.

jjcm|21 hours ago

Getting better, but definitely not there yet, nor near Sonnet 4.5 performance.

What these open models are great for are for narrow, constrained domains, with good input/output examples. I typically use them for things like prompt expansion, sentiment analysis, reformatting or re-arranging flow of code.

What I found they have trouble with is going from ambiguous description -> solved problem. Qwen 3.5 is certainly the best of the OSS models I've found (beating out GPT 120b OSS which was the previous king), and it's just starting to demonstrate true intelligence in unbound situations, but it isn't quite there yet. I have a RTX 6000 pro, so Qwen 3.5 is free for me to run, but I tend to default to Composer 1.5 if I want to be cheap.

The trend however is super encouraging. I bought my vid card with the full expectation that we'll have a locally running GPT 5.2 equiv by EoY, and I think we're on track.

solarkraft|1 day ago

Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.

Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.

derekp7|23 hours ago

"Create a single page web app scientific RPN calculator"

Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).

This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.

tempest_|1 day ago

Qwen3-Coder-30B-A3B-Instruct is good I think for in line IDE integration or operating on small functions or library code but I dont think you will get too far with one shot feature implementation that people are currently doing with Claude or whatever.

__mharrison__|22 hours ago

I used the 35b model to create a polars implementation of PCA (no sklearn or imports other than math and polars). In less than 10 minutes I had the code. This is impressive to me considering how poorly all models were with polars until very recently. (They always hallucinated pandas code.)

oscord|22 hours ago

SWE chart is missing Claude on front page, interesting way to present your data. Mix and match at will. Grown up people showing public school level sneakiness. That fact alone disqualifies your LL. Business/marketing leaders usually are brighter than average developers... so there.

shell0x|19 hours ago

Can't wait to try that out locally. Keen to reduce my dependence on American products and services.

plastic3169|14 hours ago

Anyone have recommendations on EU services where one could run open models before buying expensive hardware?

zos_kia|14 hours ago

Koyeb (recently acquired by Mistral if I'm not mistaken) have GPUs you can rent by the minute and they also have one-click deploy of some open models.

nu11ptr|23 hours ago

Thinking about getting a new MBP M5 Max 128GB (assuming they are released next week). I know "future proofing" at this stage is near impossible, but for writing Rust code locally (likely using Qwen 3.5 for now on MLX), the AIs have convinced me this is probably my best choice for immediate with some level of longevity, while retaining portability (not strictly needed, but nice to have). Alternatively was considering RTX options or a mac studio, but was leaning towards apple for the unified memory. What does HN think?

pamcake|20 hours ago

> What does HN think?

Thermals. Your workloads will be throttled hard once it inevitably runs hot. See comments elsewhere in thread about why LLMs on laptops like MBP is underwhelming. The same chips in even a studio form factor would perform much better.

nl|19 hours ago

Strix Halo machines are a good option too if you are at all price sensitive. AMD (with all the downsides of that for AI work) but people are getting decent performance from them.

Also Nvidia Spark.

shell0x|19 hours ago

I have a Mac Studio with 128GB and a M4 Max and I'd recommend it. The power usage is also pretty good, but you may not care if you live somewhere where energy is cheap.

cmenge|21 hours ago

I've been mulling the same, but decided against (for now)

Using Claude Code Max 20 so ROI would be maybe 2+ years.

CC gives me unlimited coding in 4-6 windows in parallel. Unsure if any model would beat (or even match) that, both in terms in quality and speed.

I wouldn't gamble on that now. With a subscription, I can change any time. With the machine, you risk that this great insane model comes out but you need 138GB and then you'll pay for both.

solarkraft|1 day ago

What are the recommended 4 bit quants for the 35B model? I don’t see official ones: https://huggingface.co/models?other=base_model:quantized:Qwe...

Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

xmddmx|22 hours ago

Ollama users: there are notable bugs with ollama and Qwen3.5 so don't let your first impression be the last.

Theory is that some of the model parameters aren't set properly and this encourages endless looping behavior when run under ollama:

https://github.com/ollama/ollama/issues?q=is%3Aissue%20state... (a bunch of them)

syntaxing|21 hours ago

A big part that a lot of local users forget is inference is hard. Maybe you have the wrong temperature. Maybe you have the wrong min P. Maybe you have the wrong template. Maybe the implementation in llama cpp has a bug. Maybe Q4 or even Q8 just won’t compare to BF16. Reality is, there’s so many knobs to LLM inferencing and any can make the experience worse. It’s not always the model’s fault.

mark_l_watson|1 day ago

The new 35b model is great. That said, it has slight incompatibility's with Claude Code. It is very good for tool use.

johnnyApplePRNG|1 day ago

Claude code is designed for anthropic models. Try it with opencode!

stavros|23 hours ago

Have you tried the 122B one?

erelong|1 day ago

What kind of hardware does HN recommend or like to run these models?

suprjami|1 day ago

The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.

If you want to spend twice as much for more speed, get a 3090/4090/5090.

If you want long context, get two of them.

If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.

dajonker|1 day ago

Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.

andsoitis|1 day ago

For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.

Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...

zozbot234|1 day ago

It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?

throwdbaaway|18 hours ago

For 27B, just get a used 3090 and hop on to r/LocalLLaMA. You can run a 4bpw quant at full context with Q8 KV cache.

xienze|1 day ago

It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.

elorant|1 day ago

Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.

CamperBob2|1 day ago

I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.

I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.

Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.

OkWing99|15 hours ago

Can someone who has done this, simplify and say what specs we need on a `local computer` to run and test this, with a reasonable speed?

Excluding MBP M5 128GB.

regularfry|8 hours ago

I've got the unsloth q4_K_XL 35b running in llama.cpp on an i9/64G/4090 machine doing double-digit tokens per second with a 90k+ token context window available. The model's completely in VRAM.

chvid|14 hours ago

It is slow but usable via opencode on a mbp m3 max 48 gb. So I guess hosted is still the better option for most people.

The local models are considerably better relative to the hosted ones compared to 6 months ago. Bench maxing or not - stuff is happening in this area for sure.

sunkeeh|23 hours ago

Qwen3.5-122B-A10B BF16 GGUF = 224GB. The "80Gb VRAM" mentioned here will barely fit Q4_K_S (70GB), which will NOT perform as shown on benchmarks.

Quite misleading, really.

CamperBob2|22 hours ago

The larger 3.5 quants are actually pretty close to the full-blown 397B model's performance, at least looking at the numbers. Qwen 3.5 seems more tolerant of quantization than most.

car|22 hours ago

Can it do FizzBuzz in Brainfuck? Thus far all local models have tripped over their feet or looped out.

CamperBob2|16 hours ago

122B-A10B-UD-Q4-K-XL generated https://pastebin.com/j3ddfNtS -- but I can't get it to do anything in a couple of online interpreters. Guessing it wasn't trained on a lot of Brainfuck code.

Edit: it looks like the flagship models work by writing a C or Python program to do the bookkeeping. I don't have Qwen set up to use tools, and even Opus 4.6 shits the bed when told to do it without tools [1], so not too surprising that it didn't work.

1: https://claude.ai/share/1f5289ae-decd-4dfa-98fd-0d34346008c6 -- I interrupted it and told it not to use a C/Python program or any other tools to generate the Brainfuck code, and it gave me an error message after about 10 minutes that wasn't logged to the chat.

kristianpaul|23 hours ago

https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b “ Qwen3.5-27B For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM”

kristianp|23 hours ago

18GB was an odd 3-channel one-off for the M3 Pros. I guess there's a bunch of them out there, but how slow would 27B be on it, due to not being an MOE model.

aliljet|1 day ago

Is this actually true? I want to see actual evals that match this up with Sonnet 4.5.

magicalhippo|23 hours ago

The Qwen3.5 27B model did almost the same as Sonnet 4.5 in this[1] reasoning benchmark, results here[2].

Obviously there's more to a model than that but it's a data point.

[1]: https://github.com/fairydreaming/lineage-bench

[2]: https://github.com/fairydreaming/lineage-bench-results/tree/...

lostmsu|1 day ago

Not exactly, but pretty close: https://artificialanalysis.ai/models/capabilities/coding?mod...

Somewhere between Haiku 4.5 and Sonnet 4.5

gunalx|23 hours ago

qwen 3.5 is really decent. oOtside for some weird failures on some scaffolding with seemingly different trained tools.

Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.

lubitelpospat|17 hours ago

All right guys, this is your time - what consumer device do you use for local LLM inference? GPU poor answers only

carlio|17 hours ago

An AMD AI max+ 395 - I use the one from frame.work (https://frame.work/de/en/desktop) with 128GB unified RAM and it can run a 120b model (gpt-oss:120b) just fine.

See Wendel's review here - https://www.youtube.com/watch?v=L-xgMQ-7lW0

There are other mini-pc manufacturers, the mainboard is the important part.

piyh|19 hours ago

Unsloth is working magic with the qwen quants

karmasimida|20 hours ago

Raw scale of parameters is POWER, you can't get performance out of a small model from a much larger one.

hsaliak|20 hours ago

No it does not. None of these models have the “depth” that the frontier models have across a variety of conversations, tasks and situations. Working with them is like playing snakes and ladders, you never know when it’s going to do something crazy and set you back.

jbellis|22 hours ago

this is bullshit with a kernel of truth.

none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.

BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)

benchmarks are complete, publishing on Monday.

throwdbaaway|19 hours ago

I would say 27B matches with Sonnet 4.0, while 397B A17B matches with Opus 4.1. They are indeed nowhere near Sonnet 4.5, but getting 262144 context length at good speed with modest hardware is huge for local inference.

Will check your updated ranking on Monday.

dimgl|19 hours ago

You mean 35B A3B? If this is shit, this is some of the best shit out I've seen yet. Never in a million years did I think I'd have an LLM running locally, actually writing code on my behalf. Accurately too.

jgalt212|8 hours ago

Our shop cannot use cloud models for sensitive data and code. For shops like ours, we continue to be impressed and appreciative of the progress in open-source / self-hosted models.

kristianpaul|1 day ago

They work great with kagi and pi

jedisct1|11 hours ago

Qwen3-Coder-Next also remains amazing as a local model.

If you want to use small models for coding, I'd highly recommend Swival https://swival.dev which was explicitly optimized for these.

renewiltord|18 hours ago

In practice I have not seen this. Sonnet is incredible performance. No open model is close. Hosted open models are so much worse that I end up spending more because of inferior intelligence.

pstuart|19 hours ago

One highly annoying facet of the hardware is that AND's support for the NPU under linux is currently non-existent. which abandons 50 of the 126 TOPS stated of AI capability. They seem to think that Windows support is good enough. Grrrrrr.

PunchyHamster|23 hours ago

I asked it to recite potato 100 times coz I wanted to benchmark speed of CPU vs GPU. It's on 150 line of planning. It recited the requested thing 4 times already and started drafting the 5th response.

...yeah I doubt it

lachiflippi|23 hours ago

Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weird planning mode where it reasons for minutes about what to do, and double and triple checks everything it does. Both Gemini's and Claude Opus 4.6's prompts work pretty well, but are so long that whatever you're using to run the model has to support prompt caching. Asking it to "Say the word "potato" 100 times, once per line, numbered.", for example, results in the following reasoning, followed by the word "potato" in 100 numbered lines, using the smallest (and therefore dumbest) quant unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS:

"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.

I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."

lumirth|23 hours ago

well hold on now, maybe it’s onto something. do you really know what it means to “recite” “potato” “100” “times”? each of those words could be pulled apart into a dissertation-level thesis and analysis of language, history, and communication.

either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though

xenospn|1 day ago

Are there any non-Chinese open models that offer comparable performance?

MarsIronPI|23 hours ago

I think you could look into Minstral. There's also GPT-OSS but I'm not sure how well it stacks up.

What's your problem with Chinese LLMs?

regularfry|8 hours ago

It's not only "non-Chinese" to think about here. There's nobody really touching Qwen in the single-GPU size class and there hasn't been for a couple of generations.

vagrantJin|4 hours ago

Why would anyone care if its Chinese? No one uses ChatGPT because its from the US.

culi|23 hours ago

All the western ones are closed while all the Chinese ones are open. The only exception is the European Mistral but performance of that model is not very satisfactory. Hopefully they make some improvements soon

shell0x|19 hours ago

What's the problem with Chinese models? The models are already open which makes them more trustworthy than the American closed models.

Paddyz|19 hours ago

[deleted]

mmis1000|16 hours ago

from my personal experience, qwen 30b a3b understand command quiet well as long as the input is not big enough that ruin the attention (I feel the boundary is somewhere between 8000 or 12000?). But that isn't really bug of model itself though. A smaller model just have shorter memory, it's simply physical restriction.

I made a mixed extraction, cleaning, translation, formatting task on job that have average 6000 token input. And so far, only 30b a3b is smart enough not miss job detail (most of time)

I later refactor the task to multi pass using smaller model though. Make job simpler is still a better strategy to get clean output if you can change the pipeline.

aplomb1026|21 hours ago

[deleted]

u1hcw9nx|1 day ago

[deleted]

ramon156|1 day ago

Ironically, chinese models so far have been less lobotomized compared to OAI and Anthropic's models

andsoitis|1 day ago

Yes.

251 comments