top | item 44236997

Magistral — the first reasoning model by Mistral AI

941 points| meetpateltech | 9 months ago |mistral.ai | reply

424 comments

order
[+] danielhanchen|9 months ago|reply
I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF

ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL

or

./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!

Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral

[+] pu_pe|9 months ago|reply
Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is struggling to keep up with the state-of-the-art.
[+] hmottestad|9 months ago|reply
With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

[+] epolanski|9 months ago|reply
Jm2c but I feel conflicted about this arms race.

You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.

I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?

[+] jasonthorsness|9 months ago|reply
Even if it isn't as capable, having a model with control over training is probably strategically important for every major region of the world. But it could only fall so far behind before it effectively doesn't work in the eyes of the users.
[+] tootie|9 months ago|reply
As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.
[+] littlestymaar|9 months ago|reply
> Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison.

That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).

[+] funnym0nk3y|9 months ago|reply
Thought so too. I don't know how it could be different though. They are competing against behemoths like OpenAI or Google, but have only 200 people. Even Anthropic has over 1000 people. DeepSeek has less than 200 people so the comparison seems fair.
[+] wafngar|9 months ago|reply
But they have built a fully “independent” pipeline. Deepseek and others probably trained in gpt4, o1 or whatever data.
[+] segmondy|9 months ago|reply
are you really going to compare a 24B model to a 700B+ model?
[+] fiatjaf|9 months ago|reply
This reads like an AI-generated comment. What do you mean by "benchmarks suggest"? The benchmarks are very clear and presented right there in the page.
[+] mrtksn|9 months ago|reply
Europe isn't going to catch up in tech as long as its market is open to US tech giants. Tech doesn't have marginal costs, so you want to have one of it in one place and sell it everywhere and when the infra and talent is already in US, EU tech is destined to do niche products.

UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.

Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.

If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.

Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.

[+] tensor|9 months ago|reply

[deleted]

[+] atemerev|9 months ago|reply
"EU is leading in regulation", they say.

I don't know what they are thinking.

[+] dwedge|9 months ago|reply
Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.

I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.

A month later I got the bill for that crap and deleted my account.

Maybe this is better but I'm over hype marketing from mistral

[+] megalomanu|9 months ago|reply
We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in 34–37 seconds. The output quality was slightly lower but still remain acceptable for us. We’ll continue testing, but the early results are promising. I'm glad to see Mistral prioritizing speed over raw power, there’s definitely a need for that.
[+] nbardy|9 months ago|reply
I bet you can close the gap with a finetune.

Should be quiet easy if you have some o4-mini results sitting around.

[+] kamranjon|9 months ago|reply
I am curious why you would choose a reasoning model for JSON generation?

I was recently working on a user facing feature using self-hosted Gemma 27b with VLLM and was getting fully formed JSON results in ~7 seconds (even that I would like to optimize further) - obviously the size of the JSON is important but I’d never use a reasoning model for this because they’re constantly circling and just wasting compute.

I haven’t really found a super convincing use-case for reasoning models yet, other than a chat style interface or an assistant to bounce ideas off of.

[+] simonw|9 months ago|reply
Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/
[+] atxtechbro|9 months ago|reply
Hi Simon,

What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?

Thanks, Morgan

[+] internet_points|9 months ago|reply
> I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.

but then someone found that, at least for distilled models,

> correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness

https://arxiv.org/pdf/2505.13792

ie. the conclusion doesn't necessarily follow from the reasoning. So is there still value in seeing the reasoning? There may be useful information in the reasoning, but I'm not sure it can be interpreted by humans as a typical human chain of reasoning, maybe it should be interpreted more as a loud multi-party discussion on the relevant subject which may have informed the conclusion but not necessarily lead to it.

OTOH, considering the effects of automation fatigue vs human oversight, I guess it's unlikely anyone will ever look at the reasoning in practice, except to summarily verify that it's there and tick the boxes on some form.

[+] christianqchung|9 months ago|reply
I don't understand why the benchmark selections are so scattered and limited. It only compares Magistral Medium with Deepseek V3, R1, and the other close weighted Mistral Medium 3. Why did they leave off Magistral Small entirely, alongside comparisons with Alibaba Qwen or the mini versions of o3 and o4?
[+] elAhmo|9 months ago|reply
When they include comparisons, it is always a deliberate decision what to show and, more importantly, what not to show. If they had data that would show better performance compared to those models, there is no reason for them to not emphasize that.
[+] CobrastanJorji|9 months ago|reply
Etymological fun: both "mistral" and "magistral" mean "masterly."

Mistral comes from Occitan for masterly, although today as far as I know it's only used in English when talking about mediterranean winds.

Magistral is just the adjective form of "magister," so "like a master."

If you want to make a few bucks, maybe look up some more obscure synonyms for masterly and pick up the domain names.

[+] arnaudsm|9 months ago|reply
I wished the charts included Qwen3, the current SOTA in reasoning.

Qwen3-4B almost beats Magistral-22B on the 4 available benchmarks, and Qwen3-30B-A3B is miles ahead.

[+] SparkyMcUnicorn|9 months ago|reply
30-A3B is a really impressive model.

I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o

[+] poorman|9 months ago|reply
Is there a popular benchmark site people use? Becaues I had to test all these by hand and `Qwen3-30B-A3B` still seems like the best model I can run in that relative parameter space (/memory requirements).
[+] resource_waste|9 months ago|reply
No surprise on my end. Mistral has been basically useless due to other models always being better.

But its European, so its a point of pride.

Relevance or not, we will keep hearing the name as a result.

[+] devmor|9 months ago|reply
I would agree, Qwen3 is definitely the most impressive "reasoning" model I've evaluated so far.
[+] alister|9 months ago|reply
As a quick test of logical reasoning and basic Wikipedia-level knowledge, I asked Mistral AI the following question:

A Brazilian citizen is flying from Sao Paulo to Paris, with a connection in Lisbon. Does he need to clear immigration in Lisbon or in Paris or in both cities or in neither city?

Mistral AI said that "immigration control will only be cleared in Paris," which I think is wrong.

After I pointed it to the Wikipedia article on this topic[1], it corrected itself to say that "immigration control will be cleared in Lisbon, the first point of entry into the Schengen Area."

I tried the same question with Meta AI (Llama 4) and it did much worse: It said that the traveler "wouldn't need to clear immigration in either Lisbon or Paris, given the flight connections are within the Schengen Area", which is completely incorrect.

I'd be interested to hear if other LLMs give a correct answer.

[1] https://en.wikipedia.org/wiki/Schengen_Area#Air_travel

[+] rafram|9 months ago|reply
Is the number of em-dashes in this marketing copy indicative of the kind of output that the model produces? If so, might want to tone it down a bit.
[+] bee_rider|9 months ago|reply
How many other open-weights reasoning models are there?

Is it possible to run multiple reasoning models on one problem? (Why not? I guess).

Another funny thought is: they release their Small model, and kept their Medium as a premium service. I wonder if you could do chains with Medium run occasionally, linked together by local runs of Small?

[+] nake13|9 months ago|reply
The Magistral Small can fit within a single RTX 4090 or a 32GB RAM MacBook once quantized.
[+] desireco42|9 months ago|reply
One cool think about this model, that I installed locally is that supports well other languages as well as it should be pleasant conversation partner.

BTW I am personally fan of Mistral, because while it is not the top model, it produces good results and the most important thing is that it is super fast, just go to it's chat and be amazed. It really saves a lot of time to have quick response.

[+] diggan|9 months ago|reply
The only mention of tools I could find is this:

> it significantly improves project planning, backend architecture, frontend design, and data engineering through sequenced, multi-step actions involving external tools or API.

I'm guessing this means it was trained with tool calling? And if so, does that mean it does tool calling within the thinking/reasoning, or within the main text? Seems unclear

[+] Oras|9 months ago|reply
Would be interesting to see a comparison with Qwen 32B. I found it a fantastic local model (ollama).
[+] SV_BubbleTime|9 months ago|reply
Last year, fit was important. This year, inference speed is key.

Proofreading an email at four tokens per second, great.

Spending a half hour to deep research some topic with artifacts and MCP tools and reasoning at four tokens per second… a bad time.

[+] epic9x|9 months ago|reply
This thing is crazy fast.
[+] awongh|9 months ago|reply
Interesting that their niche seems to be small parameter models.
[+] skeptrune|9 months ago|reply
Fully open reasoning traces are useful. Happy there is a vendor out there shipping that feature.
[+] 5mv2|9 months ago|reply
The featured accuracy benchmarks exclude every model that matter except DeepSeek, which is quite telling about this new model's performance.

This makes it yet another example of European companies building great products but fumbling marketing.

Mistral's edge is speed. It's a real pleasure to use because it answers in ~1s what takes other models 5-8s, which makes for a much better experience. But instead of focusing on it, they bury it far down the post.

Try it and see if you like the speed! Note that the speed advantage only applies to queries that don't require web-search, as Mistral is significantly slower on this one, leading to a ~5 seconds advantage over 2 minutes of research for the queries I benchmarked with Grok.

[+] dominicrose|9 months ago|reply
How would you use a fast AI?

My current use of AI is to generate code - or translate some code from a programming language to another - which I can then improve (instead of writing it from stratch). Speed isn't necessary for this. It's a nice-to-have but only if it's not at the cost of quality.

Also, as unfair as it "might" be, we do expect a fast AI not to be as good, don't we? So I wouldn't focus on that in the marketing. I think speed would be easier to sell as something extra you would pay for, because then you'd expect the quality to remain the same or better.

[+] funnym0nk3y|9 months ago|reply
That is reasonable though. Comparing the product of a small company with little resources with giants like Google and OpenAI in a field where most advances are due to more and more expensive models is nonsense.
[+] rfv6723|9 months ago|reply
I tried thinking with websearch on their website.

It has similar speed with o4-mini with search on chatgpt, and o4-mini gave me much better result.