Whisper – open source speech recognition by OpenAI

[+] pen2l|3 years ago|reply

Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, speaking with dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems, including when confronted with audio speech with natural tics and uhhh's and uhmm's and everything in-between.

I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.

[+] anigbrowl|3 years ago|reply

It was already better. I edit a podcast and have > a decade of pro audio editing experience in the film industry, and I was already using a commercial AI transcription service to render the content to text and sometimes edit it as such (outputting edited audio).

Existing (and affordable) offerings are so good that they can cope with shitty recordings off a phone speaker and maintain ~97% accuracy over hour-long conversations. I'm sure it's been an absolute godsend for law enforcement other people who need to gather poor-quality audio at scale, though much less great for the targets of repressive authority.

Having this fully open is a big deal though - now that level of transcription ability can be wrapped as an audio plugin and just used wherever. Given the parallel advances in resynthesis and understanding idiomatic speech, in a year or two I probably won't need to cut out all those uuh like um y'know by hand ever again, and every recording can be given an noise reduction bath and come out sounding like it was recorded in a room full of soft furniture.

[+] bambax|3 years ago|reply

The French version is a little contrived. The speaker is a native speaker, but the text is obviously the result of a translation from English to French, not idiomatic French.

I will try to put the code to the test, see how it goes.

[+] suyash|3 years ago|reply

More of this is welcome, they should live up their name and original purpose and share other models (code, weights, dataset) in the open source community as well.

[+] Workaccount2|3 years ago|reply

Can't wait to see twelve new $49.99/mo speech parser services pop up in the next few weeks.

[+] knaik94|3 years ago|reply

It seems far from good with mixed language content, especially with English and Japanese together. The timestamps are far from perfect. It's far from perfect. It's nowhere close to human for the more ambiguous translations that depend on context of word. It's far below what anyone that spoke either language would consider acceptable. Maybe it's unfair to use music, but music is the most realistic test of whether it's superior to the average human.

[+] unknown|3 years ago|reply

[deleted]

[+] darepublic|3 years ago|reply

> Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3

[+] pabs3|3 years ago|reply

Is the training dataset and code open too?

[+] catfan|3 years ago|reply

[deleted]

[+] DLeychIC|3 years ago|reply

[deleted]

[+] soheil|3 years ago|reply

[deleted]

[+] jfoster|3 years ago|reply

It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

From what I can gather:

1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

2. Includes code: https://github.com/openai/whisper

3. Released under MIT License: https://github.com/openai/whisper/blob/main/LICENSE

[+] danso|3 years ago|reply

This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]

[0] https://www.youtube.com/watch?v=DS6pE88Xg3s

[1]

    $ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3 https://www.youtube.com/watch?v=DS6pE88Xg3s

    $ whisper --language en wire-fuck.mp3
    [00:00.000 --> 00:02.000]  Oh
    [00:13.260 --> 00:15.260]  Fuck
    [00:15.260 --> 00:31.260]  Motherfucker
    [00:50.700 --> 00:52.700]  Fuck
    [00:52.700 --> 00:58.700]  Oh
    [00:58.700 --> 01:10.700]  Fuck
    [01:28.700 --> 01:55.900]  Fuck
    [02:02.340 --> 02:03.700]  Motherfuck.
    [02:10.220 --> 02:11.220]  Oh, fuck.
    [02:11.780 --> 02:12.780]  Oh, fuck.
    [02:25.900 --> 02:27.900]  Fuck, fuck, fuck, fuck, fuck, fuck.
    [02:27.900 --> 02:28.900]  Motherfucker.
    [02:32.900 --> 02:33.900]  Oh, fuck.
    [02:34.900 --> 02:35.900]  Fuck.
    [02:35.900 --> 02:36.900]  Oh, fuck.
    [02:36.900 --> 02:37.900]  Oh, fuck.
    [02:37.900 --> 02:38.900]  Oh, fuck.
    [02:48.900 --> 02:49.900]  Motherfucker.
    [02:53.900 --> 02:54.900]  Fucking A.
    [02:54.900 --> 02:56.900]  Mm hmm.
    [02:56.900 --> 03:12.900]  Fuck.
    [03:26.900 --> 03:28.900]  Motherfucker.
    [03:28.900 --> 03:32.900]  Fuck me.
    [03:58.900 --> 04:01.900]  Oh.
    [04:28.900 --> 04:34.900]  Fuck.

[+] marcelfahle|3 years ago|reply

As interesting as it is funny. Great benchmark! Here's the rev.ai output for comparison:

  Speaker 0    00:00:12    Oh, fuck motherfucker. Okay. Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck. 
 My little fuck.  
  Speaker 1    00:02:10    Oh, fuck. Oh, fuck,  
  Speaker 0    00:02:25    Fuck, fuck, fuck, fuck, fuck, fuck, fuck, fuck my motherfucker.  
  Speaker 1    00:02:53    Fucking a.  
  Speaker 0    00:02:54    Mm-hmm. <affirmative> motherfucker. Fuck me. Um,

[+] AndrewKemendo|3 years ago|reply

I've been on HN since 2012 and this might be one of the best comments I've ever read

[+] TaylorAlexander|3 years ago|reply

Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!

EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!

[+] gok|3 years ago|reply

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%

The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

[+] lunixbochs|3 years ago|reply

I suspect Whisper is more robust than other "SOTA" models, but this release is likely leaving a fair bit of accuracy on the table considering the amount of resources OpenAI is capable of throwing at training it.

Comparing the readily available test sets from the paper to some of my personal robust models (for the Talon models, this is greedy decoding, no language model):

                       Talon  Talon  Talon  Whisper  wav2vec 2.0
                       28M    300M   1B     Large    960h
    librispeech clean   3.21   2.52   2.40   2.7      2.7
    librispeech other   8.21   6.56   5.63   5.6      6.2
    common voice       13.88  11.65   8.86   9.5     29.9
    tedlium             7.51   6.55   5.47   4.0     10.5

I have a battery of more difficult tests on hand (including adversarial tests, and diverse accent-specific metrics). I'll look at running these tests on each of the Whisper model sizes and following up with a larger comparison.

[+] StevenWaterman|3 years ago|reply

One of the things they point out is that the SoTA on e.g. LibriSpeech is only good at LibriSpeech, and doesn't generalise as well.

> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

[+] petercooper|3 years ago|reply

Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.

[+] andy_xor_andrew|3 years ago|reply

Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.

[+] thuttinger|3 years ago|reply

I tried running it in realtime with live audio input (kind of).

If you want to give it a shot, you can find the python script in this repo: https://github.com/tobiashuttinger/openai-whisper-realtime

A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.

[+] adeptima|3 years ago|reply

Japanese results looks pretty impressive!

Took マッコウクジラ14頭が海岸に打ち上げられるオーストラリア(2022年9月21日) https://www.youtube.com/watch?v=bZkNIzeRBk4

Extracted audio with youtube-dl -f bestaudio https://www.youtube.com/watch\?v\=bZkNIzeRBk4

Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。

[+] gzer0|3 years ago|reply

Shocked at how good the results are, and how easy of an installation it is.

Here are the exact steps to follow to get it running on Ubuntu 22.04 via WSL and yt-dlp:

  1. pip install git+https://github.com/openai/whisper.git

  2. yt-dlp -f 'ba' -x --audio-format mp3 https://www.youtube.com/watch/?v\=bZkNIzeRBk4

  3. renamed the file to test.mp3

  4. whisper test.mp3 --language Japanese --task translate --model large

Note: the large model will download a ~3Gb file

[+] knaik94|3 years ago|reply

Did you try translating them to english? I want to see if you get a similar error as me with a random phrase "Translated by Releska" showing up.

[+] dom96|3 years ago|reply

This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.

I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.

[+] mwlp|3 years ago|reply

Super impressive. I tested it on a Japanese streamer whose enunciation isn't exactly perfect and it did a decent job: https://www.youtube.com/watch?v=ROiOU1scaNA

  [00:00.000 --> 00:06.500]  Since the last one started, the number of times I've eaten has decreased.
  [00:06.500 --> 00:11.000]  If I get too carried away with the last one, I'll get hungry and do it.
  [00:11.000 --> 00:14.500]  I don't have time to eat.
  [00:15.500 --> 00:18.000]  I'm going to eat now.
  [00:20.000 --> 00:23.000]  It's going to take about 10 minutes from here.
  [00:23.000 --> 00:31.000]  It's been a while since I've had my last meal.
  [00:31.000 --> 00:36.000]  I feel like I'm losing my女子力.
  [00:36.000 --> 00:39.000]  I have to go back to my original self.
  [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
  [00:44.000 --> 00:46.000]  It's not good.
  [00:46.000 --> 00:51.000]  I've been drinking a lot lately, so I'm going home.
  [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
  [00:53.000 --> 00:54.000]  Halloween nails.
  [00:54.000 --> 00:57.000]  Halloween, Halloween, Halloween.
  [00:57.000 --> 00:59.000]  I'm going to the beauty salon today.
  [00:59.000 --> 01:02.000]  I'm going to get my nails done the day after tomorrow.
  [01:02.000 --> 01:10.000]  I used to look at a lot of clothes, but I stopped looking at them.
  [01:10.000 --> 01:12.000]  I'm going crazy.
  [01:12.000 --> 01:22.000]  My stomach's stopped in the middle of summer.

[+] magicalhippo|3 years ago|reply

It's struggling with Norwegian. Which I guess isn't shocking. The large model performs a fair bit better than the small, though neither is "good".

Though I assume the amount of Norwegian it has been exposed to is fairly limited, so in that light I'm actually impressed as well.

I tried it on a news segment from the radio[1], this is the large model output:

    [00:14.000 --> 00:17.200]  En skamløs krenking av FN pakten.
    [00:17.200 --> 00:24.000]  USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    [00:25.500 --> 00:29.400]  Arbeidsklær som er ment til å være til begge kjønn, har det med å være tilpasset.
    [00:29.400 --> 00:33.400]  Men hvordan ville det gått, om det var motsatt?
    [00:34.100 --> 00:38.900]  Dyrevernsorganisasjon vil ha digital merking av regnstyr,
    [00:38.900 --> 00:44.900]  men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    [00:45.600 --> 00:51.400]  Mange strømselskaper er positive til å tilby kundene fastpris på strøm, og det årevis.
    [00:51.400 --> 00:59.900]  Da risikerer de å måtte betale mye i nettopp åretsvis, sier aktører som aldri tilbyr fastpris.
    [00:59.900 --> 01:21.900]  Dette er onsdagens Dagsnytten. Jeg heter Espen Ås.

For reference, here's what he actually said, from the source[1] itself:

    * En skamløs krenking av FN-pakten. USAs president og verdensledere svarer på den russiske presidentens atomtrusler og krigsmobilisering.
    * Arbeidsklær som er ment å være til begge kjønn, er som regel tilpasset ... menn. Hvordan hadde det gått om det var motsatt?
    * Dyrevernsoganisasjon vil ha digital merking av reinsdyr, men næringen selv insisterer på den gamle tradisjonsrike måten med rissing av kniv.
    * Mange strømselskaper er positive til å tilby kundene fastpris på strøm - og det i årevis.
    - Da risikerer de å måtte betale mye i nettopp; årevis, sier aktør som aldri tilbyr fastpris
    Dette er onsdagens Dagsnytt 18 - jeg heter Espen Aas.

The translation didn't fare that well though:

    [00:14.000 --> 00:17.000]  A shameless violation of the UN treaty.
    [00:17.000 --> 00:24.000]  The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    [00:24.000 --> 00:33.000]  Work clothes that are meant to be for both genders have to be suitable, but how would it be if it was the other way around?
    [00:34.000 --> 00:44.000]  The animal welfare organization will have a digital marking of reindeer, but the industry itself insists on the old traditional way of tearing a knife.
    [00:45.000 --> 00:51.000]  Many electricity companies are positive in offering customers fixed electricity prices, and that is annual.
    [00:51.000 --> 00:58.000]  Then they risk having to pay a lot in just a year, says an actor who has never offered fixed prices.
    [00:58.000 --> 01:20.000]  This is Wednesday's Dagsnytt 18. My name is Espen Ås.

For reference, here's Google Translate's attempt, which is pretty good:

    * A shameless violation of the UN Charter. The US president and world leaders respond to the Russian president's nuclear threats and war mobilization.
    * Work clothes intended for both sexes are usually adapted to ... men. How would it have gone if it had been the other way around?
    * Animal welfare organizations want digital marking of reindeer, but the industry itself insists on the old, traditional way of marking with a knife.
    * Many electricity companies are positive about offering customers a fixed price for electricity - and for years.
    - Then they risk having to pay a lot in precisely; for years, says a player who never offers a fixed price
    This is Wednesday's Dagsnytt 18 - my name is Espen Aas.

[1]: https://radio.nrk.no/podkast/dagsnytt_atten/l_5ce3e323-97a3-... (not sure if it's available outside of Norway)

[+] alach11|3 years ago|reply

How long until this gets implemented in Twitch? Real-time subtitles for any stream in the language of your choice?! That would be huge.

[+] adeptima|3 years ago|reply

translation is not the strongest part. transcription looks very good.

[+] shpx|3 years ago|reply

We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.

> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.

> https://opensource.org/osd

If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.

[+] pabs3|3 years ago|reply

The Debian deep learning team's machine learning policy would call this a "toxic candy" model:

https://salsa.debian.org/deeplearning-team/ml-policy

BTW, wouldn't you take the existing model and do additional Hokkaido Japanese speaker training on top of it, rather than retraining the model from scratch?

[+] rvz|3 years ago|reply

Yes. It just like calling the release of compiled closed binary blobs as 'open source' even when the source of reproducing the compiled output is unavailable.

> If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model's source code would include the 680,000 hours of audio used to train the model.

Precisely. These 'users' lifting the model can't do it themselves. You will still be contacting OpenAI for support or to add support for another language and they will be the ones able to modify the model.

> Just don't call it open source.

That is true, it is still closed source and already we are seeing the hype squad already apologising to OpenAI as they 'open sourced' a closed model that you can't modify yourself.

OpenAI is still business as usual and nothing has changed.

[+] nl|3 years ago|reply

This isn't really true.

You can do a lot with weights and no training data - for example you can pull the end layer off it and use it as a feature extractor.

And to modify it for Japanese speakers you'd fine train the existing model on additional data. If you wanted to modify the model you can (sometimes, depending on what you want to do) modify an existing architecture by removing layers, adding replacements and fine tuning.

I don't quite know what the right analogy of trained data is. In many ways it is more valuable than the training data because the compute needed to generate it is significant. In other ways it is nice to be able to inspect the data.

> The source code must be the preferred form in which a programmer would modify the program.

As a machine learning programmer I'd much prefer the weights than the raw data. It's no realistic for me to use that training data in any way with any compute I have access to.

[+] toss1|3 years ago|reply

Like every model I've seen there is something like this:

>>A decoder is trained to predict the corresponding text...

Prediction of expected text in the context of the previous text.

While this is valuable in casual transcription, it can be extremely dangerous in serious contexts.

From personal experience, having given a deposition with an "AI" transcription, it will literally reverse the meanings of sentences.

This is because it produces the EXPECTED output in a context, and NOT THE ACTUAL OUTPUT.

Like a speaker that clips the output, these types of systems 'clip' the really valuable information out of a transcription. Worse yet, this is a completely silent failure, as the transcript LOOKS really good.

Basic info theory shows that there is more information contained in 'surprising' chunks of data than in expected ones. These systems actively work to substitute 'expected' speech to overwrite 'surprising' speech.

The transcript I got was utter trash, multiple pages of errata I had to submit when the normal is a couple of lines. And as I said, some literally reversed the meaning in a consequential way, and yet completely silently.

This kind of silent active failure mode is terrifying. Unless it is solved, and I see no way to solve it without removing ALL predictive algos from the system, these types of systems must not be used in any situation of serious consequence, at least not without real redundancy and backup.

[+] Tomis02|3 years ago|reply

I've been saying this for years. Current "AI" algorithm are fundamentally flawed because they rely on a statistical approach. This works moderately well for some use cases but it will rarely give you 100% confidence. Good luck with self-flying planes or self-running nuclear power plants.

[+] lunixbochs|3 years ago|reply

Do you have a demo audio clip for this? I'd be interested to see how it looks in practice.

[+] eatsyourtacos|3 years ago|reply

Can this be used as a real-time transcription or is it too slow for that?

Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.

My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.

I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.

[+] StevenWaterman|3 years ago|reply

That example at the top of the page (speed talking) blew me away. He started talking, I was stunned for a minute, then realised yes, it really was English, and I just burst out laughing.

That's so, so far beyond the previous state-of-the-art, it's absurd.

[+] NaturalPhallacy|3 years ago|reply

It's a micromachines ad from the '80s. He talked like that in all of them!

As for speed, to a computer we don't talk very fast, not even that guy.

I wonder if it could handle Rap God by Eminem....Let's find out!

[+] TOMDM|3 years ago|reply

Given how robust it seems to be with fast speech, I wonder if you could save cycles by speeding up the audio before feeding it in.

[+] The5thElephant|3 years ago|reply

How is it Apple, Google, or Microsoft are not further ahead of the game on speech recognition like this? They have the resources to hire the best ML researchers and throw tons of computing hours at it, yet Siri, Google, and Cortana continue to struggle to get anywhere near this level of comprehension.

[+] wongarsu|3 years ago|reply

Siri and Cortana have to run at least in real time, with reasonable compute resources. Probably faster than real time when the audio gets shipped off to the cloud and transcribed there. This model can't do that (in the "large" version, which the examples use).

Also, you are comparing Whisper's highlight reel with everyday performance of other models. Nobody shows their weaknesses in their highlight reel.

[+] fxtentacle|3 years ago|reply

This AI has a 30 second delay on the audio processing because it needs to be able to "look into the future" to get these good results. That 30s delay would be unacceptable for Siri/Google/Cortana.

[+] beastman82|3 years ago|reply

In my unmeasured empirical observation Google has amazing speech recognition

[+] Kuinox|3 years ago|reply

OpenAI is owned by Microsoft FYI.

[+] mmh0000|3 years ago|reply

Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:

A 3m07s flac took 5m to transcribe:

  $ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
  Detecting language using up to the first 30 seconds. Use `--language` to specify the language
  Detected language: korean
  [00:00.000 --> 00:10.000]  Blackpink
  [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
  [00:14.000 --> 00:16.000]  팝콘이는 친게 껴들 생각 말고
  [00:16.000 --> 00:19.000]  I talk to talk, run ways I walk walk
  [00:19.000 --> 00:21.000]  힘 감고 팝 팝 안 봐도 척
  [00:21.000 --> 00:24.000]  By one and two by two
  [00:24.000 --> 00:26.000]  내 손끝 두 하나에 타면 아지은 중
  [00:26.000 --> 00:30.000]  갓 자쇼 지금 화려해 T makes no sense
  [00:30.000 --> 00:32.000]  You couldn't get a dollar out of me
  [00:33.000 --> 00:38.000]  자 오늘 밤이야 눈톱을 품고
  [00:38.000 --> 00:41.000]  미혼을 뺏음 down
  [00:41.000 --> 00:43.000]  Look what you made us do
  [00:43.000 --> 00:47.000]  천천히 널 잠재울 파이어
  [00:48.000 --> 00:52.000]  잠이 날 만큼 아름다워
  [00:52.000 --> 00:53.000]  I bring the pain like
  [00:53.000 --> 00:57.000]  디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
  [00:57.000 --> 00:58.000]  Get em, get em, get em
  [00:58.000 --> 01:00.000]  Straight till you don't like
  [01:00.000 --> 01:01.000]  Whoa, whoa, whoa
  [01:01.000 --> 01:03.000]  Straight till you don't like
  [01:03.000 --> 01:04.000]  Ah, ah, ah
  [01:04.000 --> 01:05.000]  Taste that, pink venom
  [01:05.000 --> 01:06.000]  Taste that, pink venom
  [01:06.000 --> 01:08.000]  Taste that, pink venom
  [01:08.000 --> 01:09.000]  Get em, get em, get em
  [01:09.000 --> 01:11.000]  Straight till you don't like
  [01:11.000 --> 01:12.000]  Whoa, whoa, whoa
  [01:12.000 --> 01:13.000]  Straight till you don't like
  [01:13.000 --> 01:14.000]  Ah, ah, ah
  [01:14.000 --> 01:15.000]  Blackpink and Amo
  [01:15.000 --> 01:17.000]  Got it by the smack ram
  [01:17.000 --> 01:18.000]  But rest in peace
  [01:18.000 --> 01:19.000]  Please light up a candle
  [01:19.000 --> 01:20.000]  This the knife of a vando
  [01:20.000 --> 01:22.000]  Messed up and I'm still in saline
  …SNIP…

[+] no1youknowz|3 years ago|reply

This is awesome. But I really want the other way.

To be able to give it text and hear the speech. A TTS (text to speech).

As a language learner, the ability to create my own sentences (based on existing ones I have, in changing a word here or there). Would be amazing.

How long till we have this I wonder. I know I could use a service to do this currently. But having something running locally, I'd prefer.

Hopefully someone in the OpenAI team reads this. :)

[+] wongarsu|3 years ago|reply

> About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.

481 comments