top | item 46886735

Voxtral Transcribe 2

1012 points| meetpateltech | 1 month ago |mistral.ai | reply

241 comments

order
[+] simonw|1 month ago|reply
This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

[+] tekacs|1 month ago|reply
Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

[+] Oras|1 month ago|reply
Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

[+] skykooler|1 month ago|reply
Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".
[+] jaggederest|1 month ago|reply
It can transcribe Eminem's Rap God fast sequence, really, really impressive.
[+] carbocation|1 month ago|reply
This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.
[+] pyprism|1 month ago|reply
Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.
[+] espadrine|1 month ago|reply
It is quite impressive.

I have seen the same impressive performance about 7 months ago here: https://kyutai.org/stt

If I look at the architecture of Voxtral 2, it seems to take a page from Kyutai’s delayed stream modeling.

The reason the delay is configurable is that you can delay the stream by a variable number of audio tokens. Each audio token is 80 ms of audio, converted to a spectrogram, fed to a convnet, passed through a transformer audio encoder, and the encoded audio embedding is passed, with a history of 1 audio embedding per 80 ms, into a text transformer, which outputs text embedding, then converted to a text token (which is thus also worth 80ms, but there is a special [STREAMING_PAD] token to skip producing a word).

There is no cross-attention in either Kyutai's STT nor in Voxtral 2, unlike Whisper's encoder-decoder design!

[+] sheepscreek|1 month ago|reply
I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!
[+] GolDDranks|1 month ago|reply
I can't get that demo to work. Tried with both Firefox and Chrome.
[+] Barbing|1 month ago|reply
Doesn’t seem to work in Safari on iOS 26.2, iPhone 17 Pro, just about anything extra disabled.
[+] darkwater|1 month ago|reply
It's really nice although I've got a sentence in French when I was speaking Italian but I corrected myself in the middle of a word.

But I'm definitely going to keep an eye on this for local-only TTS for Home Assistant.

[+] rafram|1 month ago|reply
Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.
[+] mentalgear|1 month ago|reply
Here European Multilingual-Intelligence truly shines!
[+] colordrops|1 month ago|reply
is this demo running fully in the browser?
[+] dmix|1 month ago|reply
> At approximately 4% word error rate on FLEURS and $0.003/min

Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/

[+] mdrzn|1 month ago|reply
Is it 0.003 per minute of audio uploaded, or "compute minute"?

For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.

[+] iagooar|1 month ago|reply
In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.

I tried English + Polish:

> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.

[+] loire280|1 month ago|reply
They don't claim to support Polish, but they do support Russian.

> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.

I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?

[+] lm28469|1 month ago|reply
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Try sticking to the supported languages

[+] tdb7893|1 month ago|reply
Yeah, it's too bad. Apparently it only performs well in certain languages: "The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch"
[+] yko|1 month ago|reply
That's a mix of Polish and Ukrainian in the transcript. Now, if I try speaking Ukrainian, I'm getting transcript in Russian every time. That's upsetting.
[+] Cthulhu_|1 month ago|reply
Cracking non-English or accented / mispronounced English is the white whale of text-to-speech I think; I don't know about you, but in our day to day chats there's a lot of jargon, randomly inserted English words, etc. And when they speak in English it's often what I call expat-English which is what you get when non-native speakers only speak the language with other non-native speakers.

Add poor microphone quality (using a laptop to broadcast a presentation to a room audience isn't very good) and you get a perfect storm of untranscribeable presentations or meetings.

All I want from e.g. Teams is a good transcript and, more importantly, a clever summary. Because when you think about it, imagine all the words spoken in a meeting and write them down - that's pages and pages of content that nobody would want to read in full.

[+] moffkalast|1 month ago|reply
I'm not sure why but their multilingual performance in general has usually been below average. For a French company, their models are not even close to being best in French, even outdone by the likes of Qwen. I don't think they're focusing on anything but English, the rest is just marketing.
[+] mystifyingpoi|1 month ago|reply
TBH ChatGPT does the same, when I mix Polish and English. Generally getting some cyrillic characters and it gets super confused.
[+] DaedalusII|1 month ago|reply
polish logically should be rendered in cyrillic as the cyrillic orthography more closely matches the sounds and consonant structure of slavic languages like polish and russian, although this has never been done for church reasons . maybe this is confusing ai
[+] pietz|1 month ago|reply
Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.
[+] janalsncm|1 month ago|reply
I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

[+] m463|1 month ago|reply
I don't know. What about words inherited from other languages? I think a cross-language model could improve lots of things.

For example, "here it is, voila!" "turn left on el camino real"

[+] popalchemist|1 month ago|reply
It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.
[+] mnbbrown|1 month ago|reply
Incroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.

The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.

Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?

[+] owenbrown|1 month ago|reply
The other demos didn't work for me, so I made https://github.com/owenbrown/transcribe It's just a python script to test the streaming.

Wow, Voxtral is amazing. It will be great when someone stitches this up so an LLM starts thinking, researching for you, before you actually finish talking.

Like, create a conversation partner with sub 0.5 second latency. For example, you ask it a multi part questions and, as soon as you finish talking, it gives you the answer to the first part while it looks up the rest of the answer, then stitches it together so that there's no break.

The 2-3 second latency of existing voice chatbots is a non-started for most humans.

[+] mdrzn|1 month ago|reply
There's no comparison to Whisper Large v3 or other Whisper models..

Is it better? Worse? Why do they only compare to gpt4o mini transcribe?

[+] yko|1 month ago|reply
Played with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.

But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.

[+] jiehong|1 month ago|reply
It’s nice, but the previous version wasn’t actually that great compared to Parakeet for example.

We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.

I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.

For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.

[+] serf|1 month ago|reply
things I hate:

"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"

So, you don't mean 'try this out', you mean 'buy this product'.

Let's not act like it's a free sampler.

I can't comment on the model : i'm not giving them money.

[+] satvikpendem|1 month ago|reply
Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.
[+] fph|1 month ago|reply
Is there an open source Android keyboard that would support it? Everything I find is based on Whisper, which is from 2022. Ages ago given how fast AI is evolving.
[+] barrell|1 month ago|reply
Very happy with all the mistral work. I feel like I'm always one release behind theirs. Last time they released Mistral 3 I commented saying how excited I was to try it out [1]

Well, I'm happy to report I integrated the new Mistral 3 and have been truly astounded by the results. I still am not a big fan of the model wrt factual information - it seems to be especially confident and especially wrong if left to it's own devices - but with http://phrasing.app I do most of the data aggregation myself and just use an LLM to format it. Mistral 3 was a drop-in replacement for 3x the quality (it was already very very good), 0% error rate for my use case (I had an issue for it occasionally going off the rails that was entirely solved), and sticks to my formatting guidelines perfectly (which even gpt-5-pro failed on). Plus it was somehow even cheaper.

I'm using Scribe v2 at the moment for TTS, but I'm very excited now to try integrating Voxtral Transcribe. The language support is a little lacking for my use cases, but I can always fall back to Scribe and amatorize the cost across languages. I actually was due to work on the transcription of phrasing very soon so I guess look forward to my (hopefully) glowing review on their next hn launch! XD

[1] https://news.ycombinator.com/item?id=46121889#46122612

[+] XCSme|1 month ago|reply
Is it me or error rate of 3% is really high?

If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.

[+] gwerbret|1 month ago|reply
I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.
[+] aavci|1 month ago|reply
What's the cheapest device specs that this could realistically run on?
[+] antirez|1 month ago|reply
Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.
[+] cyp0633|1 month ago|reply
It performs well on Mandarin audio transcription, considering it's an European company. It's weird though that it keeps adding spaces between single Chinese characters, and mixing traditional & simplified characters.