Qwen3-4B-Thinking-2507

nisten|6 months ago

If you want to have an opinion on it,

just install lmstudio and run the q8_0 version of it i.e. here https://huggingface.co/bartowski/Qwen_Qwen3-4B-Instruct-2507....

you can even run it on a 4gb raspberry pi Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf https://lmstudio.ai/

Keep in mind if you run it at the full 262144 tokens of context youll need ~65gb of ram.

Anyway if you're on mac you can search for "qwen3 4b 2507 mlx 4bit" and run the mlx version which is often faster on m chips. Crazy impressive what you get from a 2gb file in my opinion.

It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.

It's crazy that we're at this point now.

esafak|6 months ago

Thank you. To spare Mac readers time:

mlx 4bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 5bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 6bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

mlx 8bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...

edit: corrected the 4b link

magnat|6 months ago

> if you run it at the full 262144 tokens of context youll need ~65gb of ram

What is the relationship between context size and RAM required? Isn't the size of RAM related only to number of parameters and quantization?

Aeroi|6 months ago

how about on apple silicon for the iphone

film42|6 months ago

Is there a crowd-sourced sentiment score for models? I know all these scores are juiced like crazy. I stopped taking them at face value months ago. What I want to know is if other folks out there actually use them or if they are unreliable.

hnfong|6 months ago

Besides the LM Arena Leaderboard mentioned by a sibling comment, if go to the r/LocalLlama/ subreddit, you can very unscientifically get a rough sentiment of the performance of the models by reading the comments (and maybe even check the upvotes). I think the crowd's knee-jerk reaction is unreliable though, but that's what you asked for.

nurettin|6 months ago

This has been around for a while https://lmarena.ai/leaderboard/text/coding

klohto|6 months ago

openrouter usage stats

esafak|6 months ago

This one should work on personal computers! I'm thankful for Chinese companies raising the floor.

johndhi|6 months ago

[deleted]

frontsideair|6 months ago

According to the benchmarks, this one is improved in every one of them compared to the previous version, some better than 30B-A3B. Definitely worth a try, it’ll easily fit into memory and token generation speed will be pleasantly fast.

GaggiX|6 months ago

There is a new Qwen3-30B-A3B, you are compare it to the old one. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

gok|6 months ago

So this 4B dense model gets very similar performance to the 30B MoE variant with 7.5x smaller footprint.

smallerize|6 months ago

It gets similar performance to the old version of the 30B MoE model, but not the updated version. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507

svnt|6 months ago

It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is it possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?

tolerance|6 months ago

Is there like a leaderboard or power rankings sort of thing that tracks these small open models and assigns ratings or grades to them based on particular use cases?

esafak|6 months ago

https://artificialanalysis.ai/leaderboards/models?open_weigh...

jampa|6 months ago

I am reading this right, is this model way better than Gemma 3n[1]? (For only the benchmarks that are common among the models)

=====

LiveCodeBench

E4B IT: 13.2

Qwen: 55.2

===== AIME25

E4B IT: 11.6

Qwen: 81.3

[1]: https://huggingface.co/google/gemma-3n-E4B

meatmanek|6 months ago

Reasoning models do a lot better at AIME than non-reasoning models, with o3 mini getting 85% and 4o-mini getting 11%. It makes some sense that this would apply to small models as well.

Demiurge|6 months ago

I've been trying this today, and I'm getting a lot of hallucinations for suggestions. However, the analysis of problems really quite good.

61 comments