Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral
Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is struggling to keep up with the state-of-the-art.
With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.
Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.
If you look at Mistral investors[0], you will quickly understand that Mistral is far from being European. My understanding is it is mainly owned by US companies with a few other companies from EU and other places in the world.
You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.
I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?
Even if it isn't as capable, having a model with control over training is probably strategically important for every major region of the world. But it could only fall so far behind before it effectively doesn't work in the eyes of the users.
As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.
> Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison.
That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).
Thought so too. I don't know how it could be different though. They are competing against behemoths like OpenAI or Google, but have only 200 people. Even Anthropic has over 1000 people. DeepSeek has less than 200 people so the comparison seems fair.
This reads like an AI-generated comment. What do you mean by "benchmarks suggest"? The benchmarks are very clear and presented right there in the page.
Europe isn't going to catch up in tech as long as its market is open to US tech giants. Tech doesn't have marginal costs, so you want to have one of it in one place and sell it everywhere and when the infra and talent is already in US, EU tech is destined to do niche products.
UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.
Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.
If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.
Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.
Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.
I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.
A month later I got the bill for that crap and deleted my account.
Maybe this is better but I'm over hype marketing from mistral
We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical.
Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in 34–37 seconds. The output quality was slightly lower but still remain acceptable for us.
We’ll continue testing, but the early results are promising. I'm glad to see Mistral prioritizing speed over raw power, there’s definitely a need for that.
I am curious why you would choose a reasoning model for JSON generation?
I was recently working on a user facing feature using self-hosted Gemma 27b with VLLM and was getting fully formed JSON results in ~7 seconds (even that I would like to optimize further) - obviously the size of the JSON is important but I’d never use a reasoning model for this because they’re constantly circling and just wasting compute.
I haven’t really found a super convincing use-case for reasoning models yet, other than a chat style interface or an assistant to bounce ideas off of.
What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?
> I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.
but then someone found that, at least for distilled models,
> correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness
ie. the conclusion doesn't necessarily follow from the reasoning. So is there still value in seeing the reasoning? There may be useful information in the reasoning, but I'm not sure it can be interpreted by humans as a typical human chain of reasoning, maybe it should be interpreted more as a loud multi-party discussion on the relevant subject which may have informed the conclusion but not necessarily lead to it.
OTOH, considering the effects of automation fatigue vs human oversight, I guess it's unlikely anyone will ever look at the reasoning in practice, except to summarily verify that it's there and tick the boxes on some form.
I don't understand why the benchmark selections are so scattered and limited. It only compares Magistral Medium with Deepseek V3, R1, and the other close weighted Mistral Medium 3. Why did they leave off Magistral Small entirely, alongside comparisons with Alibaba Qwen or the mini versions of o3 and o4?
When they include comparisons, it is always a deliberate decision what to show and, more importantly, what not to show. If they had data that would show better performance compared to those models, there is no reason for them to not emphasize that.
I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o
Is there a popular benchmark site people use? Becaues I had to test all these by hand and `Qwen3-30B-A3B` still seems like the best model I can run in that relative parameter space (/memory requirements).
As a quick test of logical reasoning and basic Wikipedia-level knowledge, I asked Mistral AI the following question:
A Brazilian citizen is flying from Sao Paulo to Paris, with a connection in Lisbon. Does he need to clear immigration in Lisbon or in Paris or in both cities or in neither city?
Mistral AI said that "immigration control will only be cleared in Paris," which I think is wrong.
After I pointed it to the Wikipedia article on this topic[1], it corrected itself to say that "immigration control will be cleared in Lisbon, the first point of entry into the Schengen Area."
I tried the same question with Meta AI (Llama 4) and it did much worse: It said that the traveler "wouldn't need to clear immigration in either Lisbon or Paris, given the flight connections are within the Schengen Area", which is completely incorrect.
I'd be interested to hear if other LLMs give a correct answer.
How many other open-weights reasoning models are there?
Is it possible to run multiple reasoning models on one problem? (Why not? I guess).
Another funny thought is: they release their Small model, and kept their Medium as a premium service. I wonder if you could do chains with Medium run occasionally, linked together by local runs of Small?
One cool think about this model, that I installed locally is that supports well other languages as well as it should be pleasant conversation partner.
BTW I am personally fan of Mistral, because while it is not the top model, it produces good results and the most important thing is that it is super fast, just go to it's chat and be amazed. It really saves a lot of time to have quick response.
This doesn't really explain what "reasoning" means in the context of genAI, or how it's done by this product. Are there any good sources to learn more about what "reasoning model" means outside of marketing-speak?
> it significantly improves project planning, backend architecture, frontend design, and data engineering through sequenced, multi-step actions involving external tools or API.
I'm guessing this means it was trained with tool calling? And if so, does that mean it does tool calling within the thinking/reasoning, or within the main text? Seems unclear
The featured accuracy benchmarks exclude every model that matter except DeepSeek, which is quite telling about this new model's performance.
This makes it yet another example of European companies building great products but fumbling marketing.
Mistral's edge is speed. It's a real pleasure to use because it answers in ~1s what takes other models 5-8s, which makes for a much better experience. But instead of focusing on it, they bury it far down the post.
Try it and see if you like the speed! Note that the speed advantage only applies to queries that don't require web-search, as Mistral is significantly slower on this one, leading to a ~5 seconds advantage over 2 minutes of research for the queries I benchmarked with Grok.
My current use of AI is to generate code - or translate some code from a programming language to another - which I can then improve (instead of writing it from stratch). Speed isn't necessary for this. It's a nice-to-have but only if it's not at the cost of quality.
Also, as unfair as it "might" be, we do expect a fast AI not to be as good, don't we? So I wouldn't focus on that in the marketing.
I think speed would be easier to sell as something extra you would pay for, because then you'd expect the quality to remain the same or better.
That is reasonable though. Comparing the product of a small company with little resources with giants like Google and OpenAI in a field where most advances are due to more and more expensive models is nonsense.
[+] [-] danielhanchen|9 months ago|reply
ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL
or
./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99
Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!
Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral
[+] [-] pu_pe|9 months ago|reply
[+] [-] hmottestad|9 months ago|reply
Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...
[+] [-] melicerte|9 months ago|reply
[0] https://tracxn.com/d/companies/mistral-ai/__SLZq7rzxLYqqA97j... (edited for typo)
[+] [-] epolanski|9 months ago|reply
You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.
I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?
[+] [-] jasonthorsness|9 months ago|reply
[+] [-] tootie|9 months ago|reply
[+] [-] littlestymaar|9 months ago|reply
That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).
[+] [-] funnym0nk3y|9 months ago|reply
[+] [-] wafngar|9 months ago|reply
[+] [-] segmondy|9 months ago|reply
[+] [-] fiatjaf|9 months ago|reply
[+] [-] unknown|9 months ago|reply
[deleted]
[+] [-] mrtksn|9 months ago|reply
UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.
Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.
If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.
Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.
[+] [-] unknown|9 months ago|reply
[deleted]
[+] [-] tensor|9 months ago|reply
[deleted]
[+] [-] atemerev|9 months ago|reply
I don't know what they are thinking.
[+] [-] dwedge|9 months ago|reply
I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.
A month later I got the bill for that crap and deleted my account.
Maybe this is better but I'm over hype marketing from mistral
[+] [-] megalomanu|9 months ago|reply
[+] [-] nbardy|9 months ago|reply
Should be quiet easy if you have some o4-mini results sitting around.
[+] [-] kamranjon|9 months ago|reply
I was recently working on a user facing feature using self-hosted Gemma 27b with VLLM and was getting fully formed JSON results in ~7 seconds (even that I would like to optimize further) - obviously the size of the JSON is important but I’d never use a reasoning model for this because they’re constantly circling and just wasting compute.
I haven’t really found a super convincing use-case for reasoning models yet, other than a chat style interface or an assistant to bounce ideas off of.
[+] [-] simonw|9 months ago|reply
[+] [-] atxtechbro|9 months ago|reply
What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?
Thanks, Morgan
[+] [-] internet_points|9 months ago|reply
but then someone found that, at least for distilled models,
> correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness
https://arxiv.org/pdf/2505.13792
ie. the conclusion doesn't necessarily follow from the reasoning. So is there still value in seeing the reasoning? There may be useful information in the reasoning, but I'm not sure it can be interpreted by humans as a typical human chain of reasoning, maybe it should be interpreted more as a loud multi-party discussion on the relevant subject which may have informed the conclusion but not necessarily lead to it.
OTOH, considering the effects of automation fatigue vs human oversight, I guess it's unlikely anyone will ever look at the reasoning in practice, except to summarily verify that it's there and tick the boxes on some form.
[+] [-] christianqchung|9 months ago|reply
[+] [-] elAhmo|9 months ago|reply
[+] [-] CobrastanJorji|9 months ago|reply
Mistral comes from Occitan for masterly, although today as far as I know it's only used in English when talking about mediterranean winds.
Magistral is just the adjective form of "magister," so "like a master."
If you want to make a few bucks, maybe look up some more obscure synonyms for masterly and pick up the domain names.
[+] [-] arnaudsm|9 months ago|reply
Qwen3-4B almost beats Magistral-22B on the 4 available benchmarks, and Qwen3-30B-A3B is miles ahead.
[+] [-] SparkyMcUnicorn|9 months ago|reply
I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o
[+] [-] poorman|9 months ago|reply
[+] [-] resource_waste|9 months ago|reply
But its European, so its a point of pride.
Relevance or not, we will keep hearing the name as a result.
[+] [-] devmor|9 months ago|reply
[+] [-] alister|9 months ago|reply
A Brazilian citizen is flying from Sao Paulo to Paris, with a connection in Lisbon. Does he need to clear immigration in Lisbon or in Paris or in both cities or in neither city?
Mistral AI said that "immigration control will only be cleared in Paris," which I think is wrong.
After I pointed it to the Wikipedia article on this topic[1], it corrected itself to say that "immigration control will be cleared in Lisbon, the first point of entry into the Schengen Area."
I tried the same question with Meta AI (Llama 4) and it did much worse: It said that the traveler "wouldn't need to clear immigration in either Lisbon or Paris, given the flight connections are within the Schengen Area", which is completely incorrect.
I'd be interested to hear if other LLMs give a correct answer.
[1] https://en.wikipedia.org/wiki/Schengen_Area#Air_travel
[+] [-] rafram|9 months ago|reply
[+] [-] bee_rider|9 months ago|reply
Is it possible to run multiple reasoning models on one problem? (Why not? I guess).
Another funny thought is: they release their Small model, and kept their Medium as a premium service. I wonder if you could do chains with Medium run occasionally, linked together by local runs of Small?
[+] [-] nake13|9 months ago|reply
[+] [-] desireco42|9 months ago|reply
BTW I am personally fan of Mistral, because while it is not the top model, it produces good results and the most important thing is that it is super fast, just go to it's chat and be amazed. It really saves a lot of time to have quick response.
[+] [-] GuinansEyebrows|9 months ago|reply
[+] [-] pier25|9 months ago|reply
https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...
[+] [-] diggan|9 months ago|reply
> it significantly improves project planning, backend architecture, frontend design, and data engineering through sequenced, multi-step actions involving external tools or API.
I'm guessing this means it was trained with tool calling? And if so, does that mean it does tool calling within the thinking/reasoning, or within the main text? Seems unclear
[+] [-] Oras|9 months ago|reply
[+] [-] SV_BubbleTime|9 months ago|reply
Proofreading an email at four tokens per second, great.
Spending a half hour to deep research some topic with artifacts and MCP tools and reasoning at four tokens per second… a bad time.
[+] [-] DSingularity|9 months ago|reply
[+] [-] epic9x|9 months ago|reply
[+] [-] awongh|9 months ago|reply
[+] [-] skeptrune|9 months ago|reply
[+] [-] 5mv2|9 months ago|reply
This makes it yet another example of European companies building great products but fumbling marketing.
Mistral's edge is speed. It's a real pleasure to use because it answers in ~1s what takes other models 5-8s, which makes for a much better experience. But instead of focusing on it, they bury it far down the post.
Try it and see if you like the speed! Note that the speed advantage only applies to queries that don't require web-search, as Mistral is significantly slower on this one, leading to a ~5 seconds advantage over 2 minutes of research for the queries I benchmarked with Grok.
[+] [-] dominicrose|9 months ago|reply
My current use of AI is to generate code - or translate some code from a programming language to another - which I can then improve (instead of writing it from stratch). Speed isn't necessary for this. It's a nice-to-have but only if it's not at the cost of quality.
Also, as unfair as it "might" be, we do expect a fast AI not to be as good, don't we? So I wouldn't focus on that in the marketing. I think speed would be easier to sell as something extra you would pay for, because then you'd expect the quality to remain the same or better.
[+] [-] funnym0nk3y|9 months ago|reply
[+] [-] rfv6723|9 months ago|reply
It has similar speed with o4-mini with search on chatgpt, and o4-mini gave me much better result.