top | item 35365399

Universal Speech Model

542 points| rzk | 2 years ago |sites.research.google

198 comments

order
[+] mikeortman|2 years ago|reply
On a related note for anyone interested in this and want better performance today:

I messed with a combo of Whisper and ChatGPT. I took a whisper transcript and asked chatgpt to fix mistranscriptions using the context of the transcript and based on potential phonetic issues. I asked it to replace transcribed words that don't make sense with "[unintelligable]", which improved the output even more.

Transcription error rate was almost nonexistent, even on the smallest whisper model.

[+] akiselev|2 years ago|reply
> USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) on widely-spoken languages like English and Mandarin, but also languages like Punjabi, Assamese, Santhali, Balinese, Shona, Malagasy, Luganda, Luo, Bambara, Soga, Maninka, Xhosa, Akan, Lingala, Chichewa, Nkore, Nzema to name a few.

I find this part to be the most impressive thing in the OP. Most of those languages are spoken by fewer than 0.1% of the world and Nzema is spoken by less than half a million people. Where are they even getting enough training data to figure those languages out?

[+] est31|2 years ago|reply
They used data from youtube, with multiple selection methods. A little supervised data for 72 languages (90k hours of audio with text labels provided), some pseudo-supervised data in english (100k hours; the english labels were generated by having a model do the labeling), and a LOT of unsupervised data in 568 languages (12.1 million hours). That last group has no labels (aka captions) available, just the audio, but they create pseudo-labels by using the FLEURS dataset. I'm not really sure how this works as FLEURS itself only has 102 languages... I guess in different words, it just makes some phonetic impression of how that language should be written and then compares itself to how well it can do that impression?
[+] user3939382|2 years ago|reply
I’m familiar with Lingala and Punjabi, they have millions of speakers. The others might too.
[+] somebee|2 years ago|reply
We've been researching different speech models at Scrimba, and went for Whisper on our own infrastructure. A few days ago I stumbled onto Deepgram, which blows whisper out of the water in terms of speed and accuracy (we need high precision word level timestamps). I thought their claim of being 80x faster than whisper had to be hyperbole, but it turned out to be true for us. Would recommend checking it out for anyone who need performant speech-to-text.
[+] sason|2 years ago|reply
80x faster than Whisper is an incredible feat. How is Deepgram's transcription accuracy?

Also, have you heard of Conformer-1 by Assembly-AI[1]? It released a few days ago and supposedly scored higher than Whisper on various benchmarks.

[1]: https://www.assemblyai.com/blog/conformer-1/

[+] pantalaimon|2 years ago|reply
The nice thing about whisper is that it runs locally.
[+] Simon321|2 years ago|reply
I saw deepgrams claims as well an believed them also, then i tried it, it was TERRIBLE. Don't believe them. It only does well on the benchmark they trained it on. It is faster though but the quality is terrible.
[+] abraxas|2 years ago|reply
Yeah, I'm not sure why people get so hyped up about Whisper. In production use it's middling at best and there are commercial offerings the handily beat it in both accuracy and speed.

Whisper is mostly an academic toy.

[+] leetharris|2 years ago|reply
I work at Rev.AI, check us out

Lowest WER in the industry, cheaper than Whisper API, and we have an on-prem solution

[+] elif|2 years ago|reply
If it's the auto CC currently used by YouTube... I think it needs a few billion more (or less) sentences.

It is comically bad, with nonsense words that don't make any sense in context about every other sentence in English, and an absolute inability to produce a coherent thought in Japanese.

[+] GaggiX|2 years ago|reply
The auto CC used by Youtube has improved greatly, when I started using it years ago it was almost unintelligible but now it's impressive. (At least in English and Italian is practically perfect now)
[+] onion2k|2 years ago|reply
USM has a word error rate of ~15% on US English according to the article. That means it's getting roughly one or two words wrong in every sentence. If you're seeing wrong words in every other sentence it's doing better than you'd expect.
[+] ClimaxGravely|2 years ago|reply
That was my immediate thought as well. The Japanese to English CC's are absolutely terrible to this day.
[+] 2h|2 years ago|reply
agree on this. I wonder how threads like this even get off the ground. anyone who watches about 10 minutes of auto captioned videos can see how awful it is.
[+] kerpotgh|2 years ago|reply
Whisper also manages to add punctuations, line breaks and can attribute the speech to a particular speaker. The YouTube version is all lower case and has no punctuation. That “simple” change would make things so much better in YouTube CC.
[+] jiggawatts|2 years ago|reply
These "pure speech" models could really benefit from being coupled to a large language model like ChatGPT.

YouTube live transcriptions are terrible, because they get confused by homonyms and can't follow the context in a sentence.

In the same manner that Dall-E joined an LLM to an image generator, they ought to train a combined speech model + LLM so that the uncertainties in the speech model output is disambiguated by the LLM.

[+] reliableturing|2 years ago|reply
This is exactly part of this Google USM approach, although these pretrained models are significantly smaller than ChatGPT. They reference this paper [1] which contains more details on the pretrained text-only alignment with the speech model.

[1] https://arxiv.org/abs/2209.15329

[+] spullara|2 years ago|reply
Use whisper. It doesn't get confused in that way.
[+] IshKebab|2 years ago|reply
Are you sure about that? I've seen YouTube captions understand homophones, and Google Assistant definitely can (though that may be linked to some other system).

Surely the speech recognition model itself learns some basic language statistics in order to recognise homophones?

[+] kleiba|2 years ago|reply
There are no "pure speech" models. All ASR uses language models.
[+] andy_ppp|2 years ago|reply
Same with spell check, with context and phonetics you could make autocorrection better than any human.
[+] lysozyme|2 years ago|reply
Interesting that they enriched the training data by asking people to point out YouTube videos in specific languages for which they needed data

> YT-513-U: We create an additional dataset called YT-513-U to ensure coverage of lower resource languages in our pre-training dataset. We reached out to vendors and native speakers to identify YT videos containing speech in specific long tail languages, collecting a dataset of unlabeled speech in 513 languages. [1]

1. https://arxiv.org/abs/2303.01037

[+] gigel82|2 years ago|reply
That's cool, but Whisper is open source and I can run it today on my machine (even without a GPU) - it gives great results even compiled to WebAssembly and running in the browser with smaller models.

Totally free.

This needs to be much better to make sense and their own graphs show only marginal improvements in specific scenarios.

[+] thangalin|2 years ago|reply
For my sci-fi story (alpha readers wanted; see profile), I used Whisper to transcribe an interview of a Malawian President. From there, I created a vocabulary comprised of only the president's words, which I used almost exclusively when writing his speech.

The results from Whisper are incredible, with very few mistakes. Though it did get Nelson Mandela's first name wrong (transcribed as Nesson). What's more, Whisper finished transcribing a 60-minute audio stream in 20 minutes on commodity hardware (T1000 G8 NVIDIA GPU). Broadly, here are the steps I used:

* Download and install podman.

* Download and install git.

* Download and install curl.

* Open a command prompt.

* Run the following commands to containerize Whisper:

    git clone https://github.com/lablab-ai/whisper-api-flask whisper
    cd whisper
    mv Dockerfile Containerfile
    podman build --network="host" -t whisper .
    podman run --network="host" -p 5000:5000 whisper
* Download MP3 file (e.g., filename.mp3).

* Run the following command to produce a transcription:

    curl -F "[email protected]" http://localhost:5000/whisper
[+] mmcwilliams|2 years ago|reply
It's interesting because while evaluating Whisper for an ASR task I found it to have some entertaining hallucinations when provided with silent or garbled audio.

For instance, this was added to the transcription of a silent section of audio:

> Hello everyone welcome to my channel. Today Im going to show you how to make a very simple and easy recipe. I hope you enjoy the video. If you like my recipe dont forget to subscribe to my channel

It makes me wonder how much of Whisper is trained on audio from Youtube, which was transcribed by this model.

[+] braindead_in|2 years ago|reply
Wow, training on a dataset of 12 million hours is quite impressive! I can only imagine the engineering feats required to accomplish that. To put it into perspective, Whisper was trained on 650K hours, Speechmatics' Ursa was trained on 1M hours, and AssemblyAI trained Conformer-1 on 650K hours. I hope Meta is also working on something similar!

That being said, Speaker Diarisation is still a problem that hasn't been fully solved. As of yet, AI hasn't been able to outperform humans in this area.

[+] anigbrowl|2 years ago|reply
Release the model or go home, Google. I'm really tired of their striptease approach.
[+] novaRom|2 years ago|reply
2B weights is not too big, so with some compression + sparsity + down-sampling it will run on any device offline
[+] ironfootnz|2 years ago|reply
I don't understand, why Google doesn't let people use this in the wild, the best case scenario for improving, I've seen great Pull Requests from Whisper and lots of cool mods to help.

I feel like its watching an ad to buy a product but no 0800, QRCode or Web URL to go and use it. So frustrating.

[+] orwellg1984|2 years ago|reply
is this how it's gonna be now ? no more model release but instead "request our api"
[+] MayeulC|2 years ago|reply
Slightly disappointed this is for recognition, not for synthesis (and not open source).

Is there any good quality, open source multi-language TTS setup available? Flite works, but doesn't sound very good.

[+] jerpint|2 years ago|reply
I’ve never seen such a difficult form to fill out to get access to an API, it’s kafkaesque
[+] abraxas|2 years ago|reply
I would not use whisper as a good yardstick for English language transcription. I'm not sure what the hubbab is all about but for myself who is not a native English speaker, Whisper is not very impressive. There are engines out there that produce a far better word error rate on my speech than Whisper does.

Maybe it works well with native speakers? But since it's supposed to be so multilingual I hoped that it would work well with my accented speech... maybe that's a wrong conclusion to draw.

[+] pantalaimon|2 years ago|reply
What alternative to whisper would you recommend? I just recently looked at what’s available, and most of what I found was much worse - what did I miss?
[+] iandanforth|2 years ago|reply
This is an impressive feat. I wish auto captioning were even better though. At least for Japanese videos I find it to be less than great. If you try to throw in auto-tranlastion on top of that (which is impressive YouTube attempts at all) it falls pretty flat.
[+] astrange|2 years ago|reply
It would work better if you could feed speech embeddings to the translation model directly, since it has more language knowledge to choose what the original was more likely to say.

(Of course that might just lead to picking more common translated phrases.)

[+] codedokode|2 years ago|reply
> trained on 12 million hours of speech

Humans train to recognize speech on much smaller datasets. If a small human is awake 16 hrs/day, that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years. Why do mathematical models use more data and produce lower quality results? Is it because they don't understand the meaning of words?

[+] wasabi991011|2 years ago|reply
>that amounts to maximum 5840 hrs/year or 58400 hrs per 10 years

But no human can understand 300+ languages, especially not at 10 years old.

Still a very approximate comparison, but if you multiply your upper bound by 300, that gives 17.5 million hours, so more than what was used to train.

[+] vkazanov|2 years ago|reply
My personal explanation is how large ML models are stateless while the growing brain is a (very) stateful thing. Stateless imitation will never be able to fully replicate a stateful system.

We evolved into having these brains that learn in stages, when the early years are responsible for core functions (walking, running, talking, etc). There's a lot of input (all the senses) that feeds learning, while in later years we mostly just take for granted whatever we learnt as children.

[+] verbify|2 years ago|reply
The dataset humans are using does not just include the audio stream but also visual context. E.g. people pointing at things while talking about them.
[+] b0afc375b5|2 years ago|reply
To be fair, humans have the benefit of millions (billions?) of years of evolution.