On a related note for anyone interested in this and want better performance today:
I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.
Transcription error rate was almost nonexistent, even on the smallest whisper model.
> USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema to name a few.
I find this part to be the most impressive thing in the OP. Most of those languages are spoken by fewer than 0.1% of the world and Nzema is spoken by less than half a million people. Where are they even getting enough training data to figure those languages out?
They used data from youtube, with multiple selection methods. A little supervised data for 72 languages (90k hours of audio with text labels provided), some pseudo-supervised data in english (100k hours; the english labels were generated by having a model do the labeling), and a LOT of unsupervised data in 568 languages (12.1 million hours). That last group has no labels (aka captions) available, just the audio, but they create pseudo-labels by using the FLEURS dataset. I'm not really sure how this works as FLEURS itself only has 102 languages... I guess in different words, it just makes some phonetic impression of how that language should be written and then compares itself to how well it can do that impression?
We've been researching different speech models at Scrimba, and went for Whisper on our own infrastructure. A few days ago I stumbled onto Deepgram, which blows whisper out of the water in terms of speed and accuracy (we need high precision word level timestamps). I thought their claim of being 80x faster than whisper had to be hyperbole, but it turned out to be true for us. Would recommend checking it out for anyone who need performant speech-to-text.
I saw deepgrams claims as well an believed them also, then i tried it, it was TERRIBLE. Don't believe them. It only does well on the benchmark they trained it on. It is faster though but the quality is terrible.
Yeah, I'm not sure why people get so hyped up about Whisper. In production use it's middling at best and there are commercial offerings the handily beat it in both accuracy and speed.
If it's the auto CC currently used by YouTube... I think it needs a few billion more (or less) sentences.
It is comically bad, with nonsense words that don't make any sense in context about every other sentence in English, and an absolute inability to produce a coherent thought in Japanese.
The auto CC used by Youtube has improved greatly, when I started using it years ago it was almost unintelligible but now it's impressive. (At least in English and Italian is practically perfect now)
USM has a word error rate of ~15% on US English according to the article. That means it's getting roughly one or two words wrong in every sentence. If you're seeing wrong words in every other sentence it's doing better than you'd expect.
agree on this. I wonder how threads like this even get off the ground. anyone who watches about 10 minutes of auto captioned videos can see how awful it is.
Whisper also manages to add punctuations, line breaks and can attribute the speech to a particular speaker. The YouTube version is all lower case and has no punctuation. That “simple” change would make things so much better in YouTube CC.
These "pure speech" models could really benefit from being coupled to a large language model like ChatGPT.
YouTube live transcriptions are terrible, because they get confused by homonyms and can't follow the context in a sentence.
In the same manner that Dall-E joined an LLM to an image generator, they ought to train a combined speech model + LLM so that the uncertainties in the speech model output is disambiguated by the LLM.
This is exactly part of this Google USM approach, although these pretrained models are significantly smaller than ChatGPT. They reference this paper [1] which contains more details on the pretrained text-only alignment with the speech model.
Are you sure about that? I've seen YouTube captions understand homophones, and Google Assistant definitely can (though that may be linked to some other system).
Surely the speech recognition model itself learns some basic language statistics in order to recognise homophones?
Interesting that they enriched the training data by asking people to point out YouTube videos in specific languages for which they needed data
> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]
That's cool, but Whisper is open source and I can run it today on my machine (even without a GPU) - it gives great results even compiled to WebAssembly and running in the browser with smaller models.
Totally free.
This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.
For my sci-fi story (alpha readers wanted; see profile), I used Whisper to transcribe an interview of a Malawian President. From there, I created a vocabulary comprised of only the president's words, which I used almost exclusively when writing his speech.
The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:
* Download and install podman.
* Download and install git.
* Download and install curl.
* Open a command prompt.
* Run the following commands to containerize Whisper:
It's interesting because while evaluating Whisper for an ASR task I found it to have some entertaining hallucinations when provided with silent or garbled audio.
For instance, this was added to the transcription of a silent section of audio:
> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel
It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.
Wow, training on a dataset of 12 million hours is quite impressive! I can only imagine the engineering feats required to accomplish that. To put it into perspective, Whisper was trained on 650K hours, Speechmatics' Ursa was trained on 1M hours, and AssemblyAI trained Conformer-1 on 650K hours. I hope Meta is also working on something similar!
That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.
I don't understand, why Google doesn't let people use this in the wild, the best case scenario for improving, I've seen great Pull Requests from Whisper and lots of cool mods to help.
I feel like its watching an ad to buy a product but no 0800, QRCode or Web URL to go and use it. So frustrating.
There are a lot of text-to-speech models on Hugging Face. Some are really great and support many languages! They can be run offline easily, you just need a GPU and some python modules. https://huggingface.co/models?pipeline_tag=text-to-speech
I would not use whisper as a good yardstick for English language transcription. I'm not sure what the hubbab is all about but for myself who is not a native English speaker, Whisper is not very impressive. There are engines out there that produce a far better word error rate on my speech than Whisper does.
Maybe it works well with native speakers? But since it's supposed to be so multilingual I hoped that it would work well with my accented speech... maybe that's a wrong conclusion to draw.
What alternative to whisper would you recommend?
I just recently looked at what’s available, and most of what I found was much worse - what did I miss?
This is an impressive feat. I wish auto captioning were even better though. At least for Japanese videos I find it to be less than great. If you try to throw in auto-tranlastion on top of that (which is impressive YouTube attempts at all) it falls pretty flat.
It would work better if you could feed speech embeddings to the translation model directly, since it has more language knowledge to choose what the original was more likely to say.
(Of course that might just lead to picking more common translated phrases.)
Humans train to recognize speech on much smaller datasets. If a small human is awake 16 hrs/day, that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years. Why do mathematical models use more data and produce lower quality results? Is it because they don't understand the meaning of words?
My personal explanation is how large ML models are stateless while the growing brain is a (very) stateful thing. Stateless imitation will never be able to fully replicate a stateful system.
We evolved into having these brains that learn in stages, when the early years are responsible for core functions (walking, running, talking, etc). There's a lot of input (all the senses) that feeds learning, while in later years we mostly just take for granted whatever we learnt as children.
[+] [-] mikeortman|2 years ago|reply
I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.
Transcription error rate was almost nonexistent, even on the smallest whisper model.
[+] [-] akiselev|2 years ago|reply
I find this part to be the most impressive thing in the OP. Most of those languages are spoken by fewer than 0.1% of the world and Nzema is spoken by less than half a million people. Where are they even getting enough training data to figure those languages out?
[+] [-] est31|2 years ago|reply
[+] [-] user3939382|2 years ago|reply
[+] [-] somebee|2 years ago|reply
[+] [-] sason|2 years ago|reply
Also, have you heard of Conformer-1 by Assembly-AI[1]? It released a few days ago and supposedly scored higher than Whisper on various benchmarks.
[1]: https://www.assemblyai.com/blog/conformer-1/
[+] [-] pantalaimon|2 years ago|reply
[+] [-] Simon321|2 years ago|reply
[+] [-] abraxas|2 years ago|reply
Whisper is mostly an academic toy.
[+] [-] leetharris|2 years ago|reply
Lowest WER in the industry, cheaper than Whisper API, and we have an on-prem solution
[+] [-] elif|2 years ago|reply
It is comically bad, with nonsense words that don't make any sense in context about every other sentence in English, and an absolute inability to produce a coherent thought in Japanese.
[+] [-] GaggiX|2 years ago|reply
[+] [-] onion2k|2 years ago|reply
[+] [-] pcurve|2 years ago|reply
But it has come a long way since then.
[+] [-] ClimaxGravely|2 years ago|reply
[+] [-] 2h|2 years ago|reply
[+] [-] kerpotgh|2 years ago|reply
[+] [-] jiggawatts|2 years ago|reply
YouTube live transcriptions are terrible, because they get confused by homonyms and can't follow the context in a sentence.
In the same manner that Dall-E joined an LLM to an image generator, they ought to train a combined speech model + LLM so that the uncertainties in the speech model output is disambiguated by the LLM.
[+] [-] reliableturing|2 years ago|reply
[1] https://arxiv.org/abs/2209.15329
[+] [-] spullara|2 years ago|reply
[+] [-] IshKebab|2 years ago|reply
Surely the speech recognition model itself learns some basic language statistics in order to recognise homophones?
[+] [-] kleiba|2 years ago|reply
[+] [-] andy_ppp|2 years ago|reply
[+] [-] lysozyme|2 years ago|reply
> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]
1. https://arxiv.org/abs/2303.01037
[+] [-] gigel82|2 years ago|reply
Totally free.
This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.
[+] [-] thangalin|2 years ago|reply
The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:
* Download and install podman.
* Download and install git.
* Download and install curl.
* Open a command prompt.
* Run the following commands to containerize Whisper:
* Download MP3 file (e.g., filename.mp3).* Run the following command to produce a transcription:
[+] [-] mmcwilliams|2 years ago|reply
For instance, this was added to the transcription of a silent section of audio:
> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel
It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.
[+] [-] braindead_in|2 years ago|reply
That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.
[+] [-] anigbrowl|2 years ago|reply
[+] [-] novaRom|2 years ago|reply
[+] [-] ironfootnz|2 years ago|reply
I feel like its watching an ad to buy a product but no 0800, QRCode or Web URL to go and use it. So frustrating.
[+] [-] orwellg1984|2 years ago|reply
[+] [-] MayeulC|2 years ago|reply
Is there any good quality, open source multi-language TTS setup available? Flite works, but doesn't sound very good.
[+] [-] neop1x|2 years ago|reply
[+] [-] Ruthalas|2 years ago|reply
[1] https://github.com/rhdunn/espeak
[+] [-] jerpint|2 years ago|reply
[+] [-] practice9|2 years ago|reply
[+] [-] abraxas|2 years ago|reply
Maybe it works well with native speakers? But since it's supposed to be so multilingual I hoped that it would work well with my accented speech... maybe that's a wrong conclusion to draw.
[+] [-] pantalaimon|2 years ago|reply
[+] [-] emadda|2 years ago|reply
https://bigwav.app
[+] [-] iandanforth|2 years ago|reply
[+] [-] astrange|2 years ago|reply
(Of course that might just lead to picking more common translated phrases.)
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] codedokode|2 years ago|reply
Humans train to recognize speech on much smaller datasets. If a small human is awake 16 hrs/day, that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years. Why do mathematical models use more data and produce lower quality results? Is it because they don't understand the meaning of words?
[+] [-] wasabi991011|2 years ago|reply
But no human can understand 300+ languages, especially not at 10 years old.
Still a very approximate comparison, but if you multiply your upper bound by 300, that gives 17.5 million hours, so more than what was used to train.
[+] [-] vkazanov|2 years ago|reply
We evolved into having these brains that learn in stages, when the early years are responsible for core functions (walking, running, talking, etc). There's a lot of input (all the senses) that feeds learning, while in later years we mostly just take for granted whatever we learnt as children.
[+] [-] verbify|2 years ago|reply
[+] [-] b0afc375b5|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]