top | item 47292522

How to run Qwen 3.5 locally

487 points| Curiositry | 3 days ago |unsloth.ai

163 comments

order
[+] moqizhengz|3 days ago|reply
Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.
[+] smokel|2 days ago|reply
> This outperforms the majority of online llm services

I assume you mean outperforms in speed on the same model, not in usability compared to other more capable models.

(For those who are getting their hopes up on using local LLMs to be any replacement for Sonnet or Opus.)

[+] throwdbaaway|3 days ago|reply
There are Qwen3.5 27B quants in the range of 4 bits per weight, which fits into 16G of VRAM. The quality is comparable to Sonnet 4.0 from summer 2025. Inference speed is very good with ik_llama.cpp, and still decent with mainline llama.cpp.
[+] the_duke|2 days ago|reply
What context length and related performance are you getting out if this setup?

At least 100k context without huge degradation is important for coding tasks. Most "I'm running this locally" reports only cover testing with very small context.

[+] lukan|3 days ago|reply
What exact model are you using?

I have a 16GB GPU as well, but have never run a local model so far. According to the table in the article, 9B and 8-bit -> 13 GB and 27B and 3-bit seem to fit inside the memory. Or is there more space required for context etc?

[+] yangikan|3 days ago|reply
Do you point claude code to this? The orchestration seems to be very important.
[+] jadbox|2 days ago|reply
Did you figure out how to fix Thinking mode? I had to turn it off completely as it went on forever, and I tried to fix it with different parameters without success.
[+] bluerooibos|2 days ago|reply
These smaller models are fine for Q&A type stuff but are basically unuseable for anything agentic like large file modifications, coding, second brain type stuff - they need so much handholding. I'd be interested to see a demo of what the larger versions can do on better hardware though.
[+] y42|2 days ago|reply
> consumer-grade hardware

Not disagreeing per se, but a quick look at the installation instructions confirms what I assumed:

Yeah, you can run a highly quantized version on your 2020 Nvidia GPU. But:

- When inferencing, it occupies your "whole machine.". At least you have a modern interactive heating feature in your flat.

- You need to follow eleven-thousand nerdy steps to get it running; my mum is really looking forward to that.

- Not to mention the pain you went through installing Nvidia drivers; nothing my mum will ever manage in the near future.

... and all this to get something that merely competes with Haiku.

Don't get me wrong - I am exaggerating, I know. It's important to have competition and the opportunity to run "AI" on your own metal. But this reminds me of the early days of smartphones and my old XDA Neo. Sure, it was damn smart, and I remember all those jealous faces because of my "device from the future." But oh boy, it was also a PITA maintaining it.

Here we are now. Running AI locally is a sneak peek into the future. But as long as you need a CS degree and hardware worth a small car to achieve reasonable results, it's far from mainstream. Therefore, "consumer-grade hardware" sounds like a euphemism here.

I like how we nerds are living in our buble celebrating this stuff while 99% of mankind still doomscroll through facebook and laughing at (now AI generated) brain rot.

(No offense (ʘ‿ʘ)╯)

[+] mingodad|2 days ago|reply
I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:

IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB

And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .

I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.

I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.

Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.

I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.

Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.

[+] danielhanchen|2 days ago|reply
Oh https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks might be helpful - it provides benchmarks for Q4_K_XL vs Q4_K_M etc for disk space vs KL Divergence (proxy for how close to the original full precision model)

Q4_0 and Q4_1 were supposed to provide faster inference, but tests showed it reduced accuracy by quite a bit, so they are deprecated now.

Q4_K_M and UD-Q4_K_XL are the same, just _XL is slightly bigger than _M

The naming convention is _XL > _L > _M > _S > _XS

[+] PhilippGille|2 days ago|reply
> would be nice to have a place that have a list of typical models/ hardware listed with it's common config parameters and memory usage

https://www.localscore.ai from Mozilla Builders was supposed to be this, but there are not enough users I guess, I didn't find any Qwen 3.5 entries yet

[+] ay|2 days ago|reply
I tried qwen3.5:4b in ollama on my 4 year old Mac M1 with my own coding harness and it exhibited pretty decent tool calling, but it is a bit slow and seemed a little confused with the more complex tasks (also, I have it code rust, that might add complexity). The task was “find the debug that does X and make it conditional based on the whichever variable is controlled by the CLI ‘/debug foo’” - I didn’t do much with it after that.

It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.

I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.

HTH.

[+] antirez|2 days ago|reply
My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations.

    1   │ DeepSeek API -- 100%
    2   │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
    3   │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
    4   │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
    5   │ qwen3.5:27b-q8_0 (thinking) -- 75.3%
I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.
[+] throwdbaaway|2 days ago|reply
Yours is the only benchmark that puts 35B A3B above 27B. Time for human judgement to verify? For example, if you look at the thinking traces, there might be logical inconsistencies in the prompts, which then tripped up the 27B more when reasoning. This will also be reflected in the score when thinking is disabled, but we can sort of debug with the thinking traces.
[+] alansaber|2 days ago|reply
Maybe a reductive question but are there any thinking models that don't (relatively) add much latency?
[+] d4rkp4ttern|2 days ago|reply
For every new interesting open model I try to test PP (prompt processing) and TG (token gen) speeds via llama-cpp/server in Claude Code (which can have at least 15-30K tokens context due system prompt and tools etc), on my good old M1 Max 64GB MacBook.

With the latest llama-cpp build from source and latest unsloth quants, the TG speed of Qwen3.5-30B-A3B is around half of Qwen3-30B-A3B (with 33K tokens initial Claude Code context), so the older Qwen3 is much more usable.

Qwen3-30B-A3B (Q4_K_M):

  - PP: 272 tok/s | TG: 25 tok/s @ 33k depth

  - KV cache: f16

  - Cache reuse: follow-up delta processed in 0.4s
Qwen3.5-35B-A3B (Q4_K_M):

  - PP: 395 tok/s | TG: 12 tok/s @ 33k depth

  - KV cache: q8_0

  - Cache reuse: follow-up delta processed in 2.7s (requires --swa-full)
Qwen3.5's sliding window attention uses significantly less RAM and delivers better response quality, but at 33k context depth it generates at half the tok/s of the standard-attention Qwen3-30B.

Full llama-server and Claude-Code setup details here for these and other open LLMs:

https://pchalasani.github.io/claude-code-tools/integrations/...

[+] regularfry|2 days ago|reply
I definitely get the impression there's something not quite right with qwen3.5 in llama.cpp. It's impressive but just a bit off. A patch landed yesterday which helped though.
[+] Twirrim|3 days ago|reply
I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
[+] fy20|3 days ago|reply
I guess you are doing offloading to system RAM? What tokens per second do you get? I've got an old gaming laptop with a RTX 3060, sounds like it could work well as a local inference server.
[+] ufish235|3 days ago|reply
Can you give an example of some coding tasks? I had no idea local was that good.
[+] fragmede|3 days ago|reply
Which models would that be?
[+] Curiositry|3 days ago|reply
Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
[+] AllegedAlec|1 day ago|reply
I found that the drivers I had were no longer compatible with the newer kernels. After upgrading to newer drivers it was able to offload again.
[+] acters|3 days ago|reply
I have a 1660ti and the cachyos + aur/llama.cpp-cuda package is working fine for me. With about 5.3 GB of usable memory, I find that the 35B model is by far the most capable one that performs just as fast as the 4B model that fits entirely on my GPU. I did try the 9B model and was surprisingly capable. However 35B still better in some of my own anecdotal test cases. Very happy with the improvement. However, I notice that qwen 3.5 is about half the speed of qwen 3
[+] dunb|2 days ago|reply
Are you running with all the --fit options and it’s not working correctly? You could try looking at how many layers are being attempted to offload and manually adjust from there. Walk down --n-gpu-layers with a bash script until it loads.
[+] lioeters|2 days ago|reply
> GPU offloading working

I had this issue which in my case was solved by installing a newer driver. YMMV.

  sudo apt install nvidia-driver-570
[+] WhyNotHugo|2 days ago|reply
If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading.
[+] tasuki|2 days ago|reply
How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?
[+] labcomputer|2 days ago|reply
There were some benchmarks a few years ago from, IIRC, the people behind either llama.cpp or Ollama (I forget which).

The basic rule of thumb is that more parameters is always better, with diminishing returns as you get down to 2-3 bits per parameter. This is purely based on model quality, not inference speed.

[+] paoliniluis|2 days ago|reply
just finding the perfect spot between accuracy of the answers/available VRAM/tokens per second
[+] PeterStuer|2 days ago|reply
I am running both Qwen-coder-next and Qwen 3.5 locally. Not too bad, but I always have Opus 4.6 checking their output as the Qwen family tends to hallucinate non existing library features in amounts similar to the Claude 3.5 / GPT 4 era.

The combo of free long running tasks on Qwen overnight with steering and corrections from Opus works for me.

I guess I could just do Opus/Sonnet for my Claude Code back-end, but I specifically want to keep local open weights models in the loop just in case the hosted models decide to quit on e.g. non-US users.

[+] brcmthrowaway|2 days ago|reply
How did they solve the hallucination? Reasoning tokens?
[+] b89kim|2 days ago|reply
I’ve been benchmarking GGUF quants for Python tasks under some hardware configs.

  - 4090 : 27b-q4_k_m
  - A100: 27b-q6_k
  - 3*A100: 122b-a10b-q6_k_L
Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.
[+] _qua|2 days ago|reply
For roughly equivalent memory sizes, how does one choose between the bit depth and the model size?
[+] jedisct1|2 days ago|reply
Qwen3.5-27B works amazingly well with https://swival.dev now that the unsloth quants have fixed the tool calling issues.

I still like and mainly use Qwen3-Coder-Next, though, as it seems to be generally more reliable.

[+] adsharma|2 days ago|reply
So many variants of these models. The ggufs from unsloth don't work with ollama. Perhaps wait for a bit for the latest llama.cpp to be picked up by downstream projects.

If you're on a 16GB Mac mini, what's a good variant to run?

[+] rurban|2 days ago|reply
We did run it locally on a free H100, and it performed awfully. With vLLM and opencode. Now we are running gpt-oss-120b which is better, but still far behind opus 4.6, the only coding model which is better than our most experienced senior dev. gpt-5.3-codex is more like on the sonnet level on complicated C code. Bearable, but still many stupidities. gpt-oss is hilariously stupid, but might work for typescript, react, python simple tasks.

For vision qwen is the best, our goto vision model.

[+] veritascap|2 days ago|reply
How does scaffolding work with these local models? Skills, commands, rules, etc. do they all work similarly? (It’s probably obvious but I haven’t delved into local LLMs yet.)
[+] vvram|2 days ago|reply
What would be optimal HW configurations/systems recommended?
[+] benbojangles|2 days ago|reply
I'm running Qwen3.5:0.8b locally on an Orangepi Zero 2w using llama.cpp, runs just fine on cpu only. If I want vulkan GPU I have run qwen3.5:2b locally on a meta quest 3 with zeroclaw and saved myself hundreds of $$$ buying a low power computer. I recommend people stop shopping around for inflated mac minis and look at getting a used android phone to load local models on.
[+] ilaksh|2 days ago|reply
Anyone providing hosted inference for 9B? I'm just trying to save the operational effort of renting a GPU since this is a business use case that doesn't have real GPUs available right now. I don't see the small ones on OpenRouter. Maybe there will be a runpod serverless or normal pod template or something.

Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?

[+] mongrelion|1 day ago|reply
By anyone do you mean a well-established business or any entity willing to serve you?
[+] latrine5526|2 days ago|reply
I have a 5090d and got ~140 token/s output when running qwen-3.5-9b-heretic in lmstudio.

I disabled the thinking and configured the translate plugin on my browser to use the lmstudio API.

It performs way better than Google Translate in accuracy. The speed is a little slower, but sufficient for me.

[+] edg5000|2 days ago|reply
How does 397B-A17B compare against frontier? Did anybody try? Probably needs serious HW that most people don't have.