Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.
The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.
I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.
It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).
Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.
Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.
The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.
I will try to put the code to the test, see how it goes.
More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.
It seems far from good with mixed language content, especially with English and Japanese together. The timestamps are far from perfect. It's far from perfect. It's nowhere close to human for the more ambiguous translations that depend on context of word. It's far below what anyone that spoke either language would consider acceptable. Maybe it's unfair to use music, but music is the most realistic test of whether it's superior to the average human.
This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]
Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!
EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!
I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.
Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):
Talon Talon Talon Whisper wav2vec 2.0
28M 300M 1B Large 960h
librispeech clean 3.21 2.52 2.40 2.7 2.7
librispeech other 8.21 6.56 5.63 5.6 6.2
common voice 13.88 11.65 8.86 9.5 29.9
tedlium 7.51 6.55 5.47 4.0 10.5
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.
One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.
> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.
Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.
Hold on, it does not only speech recognition, but also language translation, in the same model?
What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?
It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!
Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.
A bit more context on how it works:
The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.
This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.
I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.
[00:00.000 --> 00:06.500] Since the last one started, the number of times I've eaten has decreased.
[00:06.500 --> 00:11.000] If I get too carried away with the last one, I'll get hungry and do it.
[00:11.000 --> 00:14.500] I don't have time to eat.
[00:15.500 --> 00:18.000] I'm going to eat now.
[00:20.000 --> 00:23.000] It's going to take about 10 minutes from here.
[00:23.000 --> 00:31.000] It's been a while since I've had my last meal.
[00:31.000 --> 00:36.000] I feel like I'm losing my女子力.
[00:36.000 --> 00:39.000] I have to go back to my original self.
[00:39.000 --> 00:44.000] I have to get ready and go to bed.
[00:44.000 --> 00:46.000] It's not good.
[00:46.000 --> 00:51.000] I've been drinking a lot lately, so I'm going home.
[00:51.000 --> 00:53.000] I have to get my nails done this fall.
[00:53.000 --> 00:54.000] Halloween nails.
[00:54.000 --> 00:57.000] Halloween, Halloween, Halloween.
[00:57.000 --> 00:59.000] I'm going to the beauty salon today.
[00:59.000 --> 01:02.000] I'm going to get my nails done the day after tomorrow.
[01:02.000 --> 01:10.000] I used to look at a lot of clothes, but I stopped looking at them.
[01:10.000 --> 01:12.000] I'm going crazy.
[01:12.000 --> 01:22.000] My stomach's stopped in the middle of summer.
It's struggling with Norwegian. Which I guess isn't shocking. The large model performs a fair bit better than the small, though neither is "good".
Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.
I tried it on a news segment from the radio[1], this is the large model output:
[00:14.000 --> 00:17.200] En skamløs krenking av FN pakten.
[00:17.200 --> 00:24.000] USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
[00:25.500 --> 00:29.400] Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
[00:29.400 --> 00:33.400] Men hvordan ville det gått, om det var motsatt?
[00:34.100 --> 00:38.900] Dyrevernsorganisasjon vil ha digital merking av regnstyr,
[00:38.900 --> 00:44.900] men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
[00:45.600 --> 00:51.400] Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
[00:51.400 --> 00:59.900] Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
[00:59.900 --> 01:21.900] Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.
For reference, here's what he actually said, from the source[1] itself:
* En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
* Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
* Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
* Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
- Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.
The translation didn't fare that well though:
[00:14.000 --> 00:17.000] A shameless violation of the UN treaty.
[00:17.000 --> 00:24.000] The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
[00:24.000 --> 00:33.000] Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
[00:34.000 --> 00:44.000] The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
[00:45.000 --> 00:51.000] Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
[00:51.000 --> 00:58.000] Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
[00:58.000 --> 01:20.000] This is Wednesday's Dagsnytt 18. My name is Espen Ås.
For reference, here's Google Translate's attempt, which is pretty good:
* A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
* Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
* Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
* Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
- Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
This is Wednesday's Dagsnytt 18 - my name is Espen Aas.
We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.
> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.
If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.
BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?
Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.
> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.
> Just don't call it open source.
That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.
OpenAI is still business as usual and nothing has changed.
You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.
And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.
I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.
> The source code must be the preferred form in which a programmer would modify the program.
As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.
Like every model I've seen there is something like this:
>>A decoder is trained to predict the corresponding text...
Prediction of expected text in the context of the previous text.
While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.
From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.
This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.
Like a speaker that clips the output, these types of systems 'clip' the really valuable information out of a transcription. Worse yet, this is a completely silent failure, as the transcript LOOKS really good.
Basic info theory shows that there is more information contained in 'surprising' chunks of data than in expected ones. These systems actively work to substitute 'expected' speech to overwrite 'surprising' speech.
The transcript I got was utter trash, multiple pages of errata I had to submit when the normal is a couple of lines. And as I said, some literally reversed the meaning in a consequential way, and yet completely silently.
This kind of silent active failure mode is terrifying. Unless it is solved, and I see no way to solve it without removing ALL predictive algos from the system, these types of systems must not be used in any situation of serious consequence, at least not without real redundancy and backup.
I've been saying this for years. Current "AI" algorithm are fundamentally flawed because they rely on a statistical approach. This works moderately well for some use cases but it will rarely give you 100% confidence.
Good luck with self-flying planes or self-running nuclear power plants.
Can this be used as a real-time transcription or is it too slow for that?
Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.
My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.
I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.
That example at the top of the page (speed talking) blew me away. He started talking, I was stunned for a minute, then realised yes, it really was English, and I just burst out laughing.
That's so, so far beyond the previous state-of-the-art, it's absurd.
How is it Apple, Google, or Microsoft are not further ahead of the game on speech recognition like this? They have the resources to hire the best ML researchers and throw tons of computing hours at it, yet Siri, Google, and Cortana continue to struggle to get anywhere near this level of comprehension.
Siri and Cortana have to run at least in real time, with reasonable compute resources. Probably faster than real time when the audio gets shipped off to the cloud and transcribed there. This model can't do that (in the "large" version, which the examples use).
Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.
This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.
Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:
A 3m07s flac took 5m to transcribe:
$ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: korean
[00:00.000 --> 00:10.000] Blackpink
[00:11.000 --> 00:14.000] Kick in the door, wave in the coco
[00:14.000 --> 00:16.000] 팝콘이는 친게 껴들 생각 말고
[00:16.000 --> 00:19.000] I talk to talk, run ways I walk walk
[00:19.000 --> 00:21.000] 힘 감고 팝 팝 안 봐도 척
[00:21.000 --> 00:24.000] By one and two by two
[00:24.000 --> 00:26.000] 내 손끝 두 하나에 타면 아지은 중
[00:26.000 --> 00:30.000] 갓 자쇼 지금 화려해 T makes no sense
[00:30.000 --> 00:32.000] You couldn't get a dollar out of me
[00:33.000 --> 00:38.000] 자 오늘 밤이야 눈톱을 품고
[00:38.000 --> 00:41.000] 미혼을 뺏음 down
[00:41.000 --> 00:43.000] Look what you made us do
[00:43.000 --> 00:47.000] 천천히 널 잠재울 파이어
[00:48.000 --> 00:52.000] 잠이 날 만큼 아름다워
[00:52.000 --> 00:53.000] I bring the pain like
[00:53.000 --> 00:57.000] 디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
[00:57.000 --> 00:58.000] Get em, get em, get em
[00:58.000 --> 01:00.000] Straight till you don't like
[01:00.000 --> 01:01.000] Whoa, whoa, whoa
[01:01.000 --> 01:03.000] Straight till you don't like
[01:03.000 --> 01:04.000] Ah, ah, ah
[01:04.000 --> 01:05.000] Taste that, pink venom
[01:05.000 --> 01:06.000] Taste that, pink venom
[01:06.000 --> 01:08.000] Taste that, pink venom
[01:08.000 --> 01:09.000] Get em, get em, get em
[01:09.000 --> 01:11.000] Straight till you don't like
[01:11.000 --> 01:12.000] Whoa, whoa, whoa
[01:12.000 --> 01:13.000] Straight till you don't like
[01:13.000 --> 01:14.000] Ah, ah, ah
[01:14.000 --> 01:15.000] Blackpink and Amo
[01:15.000 --> 01:17.000] Got it by the smack ram
[01:17.000 --> 01:18.000] But rest in peace
[01:18.000 --> 01:19.000] Please light up a candle
[01:19.000 --> 01:20.000] This the knife of a vando
[01:20.000 --> 01:22.000] Messed up and I'm still in saline
…SNIP…
> About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.
That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.
[+] [-] pen2l|3 years ago|reply
The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.
I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.
[+] [-] anigbrowl|3 years ago|reply
Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.
Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.
[+] [-] bambax|3 years ago|reply
I will try to put the code to the test, see how it goes.
[+] [-] suyash|3 years ago|reply
[+] [-] Workaccount2|3 years ago|reply
[+] [-] knaik94|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] darepublic|3 years ago|reply
Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3
[+] [-] pabs3|3 years ago|reply
[+] [-] catfan|3 years ago|reply
[deleted]
[+] [-] DLeychIC|3 years ago|reply
[deleted]
[+] [-] soheil|3 years ago|reply
[deleted]
[+] [-] jfoster|3 years ago|reply
From what I can gather:
1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.
2. Includes code: https://github.com/openai/whisper
3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE
[+] [-] danso|3 years ago|reply
[0] https://www.youtube.com/watch?v=DS6pE88Xg3s
[1]
[+] [-] marcelfahle|3 years ago|reply
[+] [-] AndrewKemendo|3 years ago|reply
[+] [-] TaylorAlexander|3 years ago|reply
EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!
[+] [-] gok|3 years ago|reply
[1] https://github.com/syhw/wer_are_we
[+] [-] lunixbochs|3 years ago|reply
Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):
I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.[+] [-] StevenWaterman|3 years ago|reply
> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.
[+] [-] petercooper|3 years ago|reply
[+] [-] andy_xor_andrew|3 years ago|reply
What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?
It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!
Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.
[+] [-] thuttinger|3 years ago|reply
If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime
A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.
[+] [-] adeptima|3 years ago|reply
Took マッコウクジラ14頭が海岸に打ち上げられる オーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4
Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4
Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。
[+] [-] gzer0|3 years ago|reply
Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:
Note: the large model will download a ~3Gb file[+] [-] knaik94|3 years ago|reply
[+] [-] dom96|3 years ago|reply
I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.
[+] [-] mwlp|3 years ago|reply
[+] [-] magicalhippo|3 years ago|reply
Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.
I tried it on a news segment from the radio[1], this is the large model output:
For reference, here's what he actually said, from the source[1] itself: The translation didn't fare that well though: For reference, here's Google Translate's attempt, which is pretty good: [1]: https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... (not sure if it's available outside of Norway)[+] [-] alach11|3 years ago|reply
[+] [-] adeptima|3 years ago|reply
[+] [-] shpx|3 years ago|reply
> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.
> https://opensource.org/osd
If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.
[+] [-] pabs3|3 years ago|reply
https://salsa.debian.org/deeplearning-team/ml-policy
BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?
[+] [-] rvz|3 years ago|reply
> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.
Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.
> Just don't call it open source.
That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.
OpenAI is still business as usual and nothing has changed.
[+] [-] nl|3 years ago|reply
You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.
And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.
I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.
> The source code must be the preferred form in which a programmer would modify the program.
As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.
[+] [-] toss1|3 years ago|reply
>>A decoder is trained to predict the corresponding text...
Prediction of expected text in the context of the previous text.
While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.
From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.
This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.
Like a speaker that clips the output, these types of systems 'clip' the really valuable information out of a transcription. Worse yet, this is a completely silent failure, as the transcript LOOKS really good.
Basic info theory shows that there is more information contained in 'surprising' chunks of data than in expected ones. These systems actively work to substitute 'expected' speech to overwrite 'surprising' speech.
The transcript I got was utter trash, multiple pages of errata I had to submit when the normal is a couple of lines. And as I said, some literally reversed the meaning in a consequential way, and yet completely silently.
This kind of silent active failure mode is terrifying. Unless it is solved, and I see no way to solve it without removing ALL predictive algos from the system, these types of systems must not be used in any situation of serious consequence, at least not without real redundancy and backup.
[+] [-] Tomis02|3 years ago|reply
[+] [-] lunixbochs|3 years ago|reply
[+] [-] eatsyourtacos|3 years ago|reply
Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.
My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.
I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.
[+] [-] StevenWaterman|3 years ago|reply
That's so, so far beyond the previous state-of-the-art, it's absurd.
[+] [-] NaturalPhallacy|3 years ago|reply
As for speed, to a computer we don't talk very fast, not even that guy.
I wonder if it could handle Rap God by Eminem....Let's find out!
[+] [-] TOMDM|3 years ago|reply
[+] [-] The5thElephant|3 years ago|reply
[+] [-] wongarsu|3 years ago|reply
Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.
[+] [-] fxtentacle|3 years ago|reply
[+] [-] beastman82|3 years ago|reply
[+] [-] Kuinox|3 years ago|reply
[+] [-] mmh0000|3 years ago|reply
A 3m07s flac took 5m to transcribe:
[+] [-] no1youknowz|3 years ago|reply
To be able to give it text and hear the speech. A TTS (text to speech).
As a language learner, the ability to create my own sentences (based on existing ones I have, in changing a word here or there). Would be amazing.
How long till we have this I wonder. I know I could use a service to do this currently. But having something running locally, I'd prefer.
Hopefully someone in the OpenAI team reads this. :)
[+] [-] wongarsu|3 years ago|reply
That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.