top | item 23695377

Mozilla Common Voice Dataset: More data, more languages

379 points| dabinat | 5 years ago |discourse.mozilla.org | reply

41 comments

order
[+] echelon|5 years ago|reply
Data in ML is critical, and this release from Mozilla is absolute gold for voice research.

This dataset and will help the many independent deep learning practitioners such as myself that aren't working at FAANG and have only had access to datasets such as LJS [1] or self-constructed datasets that have been cobbled together and manually transcribed.

Despite the limited materials available, there's already some truly amazing stuff being created. We've seen a lot of visually creative work being produced in the past few years, but the artistic community is only getting started with voice and sound.

https://www.youtube.com/watch?v=3qR8I5zlMHs

https://www.youtube.com/watch?v=L69gMxdvpUM

Another really cool thing popping up are TTS systems trained from non-English speakers reading English corpuses. I've heard Angela Merkel reciting copypastas, and it's quite amazing.

I've personally been dabbling in TTS as one of my "pandemic side projects" and found it to be quite fun and rewarding:

https://trumped.com

https://vo.codes

Besides TTS, one of the areas I think this data set will really help with is the domain of Voice Conversion (VC). It'll be awesome to join Discord or TeamSpeak and talk in the voice of Gollum or Rick Sanchez. The VC field needs more data to perfect non-aligned training (where source and target speakers aren't reciting the same training text that is temporally aligned), and this will be extremely helpful.

I think the future possibilities for ML techniques in art and media are nearly limitless. It's truly an exciting frontier to watch rapidly evolve and to participate in.

[1] https://keithito.com/LJ-Speech-Dataset/

[+] indogooner|5 years ago|reply
Curious to know why don't researchers use Audiobooks/Videos and transcript when data is not available? Is it because these do not capture different dialects/accents?
[+] cptwunderlich|5 years ago|reply
Oh man, I'm really interested in TTS (for rarer languages). Do you have any pointers or good resources to share?
[+] lunixbochs|5 years ago|reply
This is great! I’m always excited to see new common voice releases.

As someone actively using the data I wish I could more easily see (and download lists for?) the older releases as there have been 3-4 dataset updates for English now. If we don’t have access to versioned datasets, there’s no way to reproduce old whitepapers or models that use common voice. And at this point I don’t remember the statistics (hours, accent/gender breakdown) for each release. It would be neat to see that over time on the website.

I’m glad they’re working on single word recognition! This is something I’ve put significant effort into. It’s the biggest gap I’ve found in the existing public datasets - listening to someone read an audiobook or recite a sentence doesn’t seem to prepare the model very well for recognizing single words in isolation.

My model and training process have adapted for that, though I’m still not sure of the best way to balance training of that sort of thing. I have maybe 5 examples of each English word in isolation but 5000 examples of each number (Speech Commands), and it seems like the model will prefer e.g. “eight” over “ace”, I guess due to training balance.

Maybe I should be randomly sampling 50/5000 of the imbalanced words each epoch so the model still has a chance to learn from them without overtraining?

[+] scribu|5 years ago|reply
What if you first trained a classifier that told you if the uttereance is a single word vs. multiple words? Then, based on that prediction, you would use one of two separate models.

The technique you're thinking of is called oversampling and there are many other general techniques for dealing with imbalanced datasets, as it's a very common situation.

[+] jointpdf|5 years ago|reply
Does this dataset include people with voice or speech disorders (or other disabilities)? I don’t see any mention of it in this announcement or the forums, though I haven’t looked thoroughly (yet).

Examples: dysphonias of various kinds, dysarthria (e.g. from ALS / cerebral palsy), vocal fold atrophy, stuttering, people with laryngectomies / voice prosthesis, and many more.

Altogether, this represents millions of people for whom current speech recognition systems do not work well. This is an especially tragic situation, since people with disabilities depend more heavily on assistive technologies like ASR. Data/ML bias is rightfully a hot topic lately, so I feel that the voices of people w/ disabilities need to be amplified as well (npi).

[+] daanzu|5 years ago|reply
Gathering, collecting, and publishing such a dataset would be great, and would certainly much improve the baseline speech recognition for people with disordered speech, but it can only help so much without personalizing to a specific individual. This is true for anybody, but more so for disordered speech. This is an area where I think "generic" solutions will inevitably struggle, even if they are somewhat specialized on "generic disarthritic" speech.

However, this means that the gains to be had from personalized training are greater for disordered speech than for "average" speech. I develop kaldi-active-grammar [0], which specializes the Kaldi speech recognition engine for real-time command & control with many complex grammars. I am also working on making it easier to train personalized speech models, and to fine tune generic models with training for an individual. I have posted basic numbers on some small experiments [1]. Such personalized training can be time consuming (depending on how far one wants to take it), but as my parent comment says, disabled people may need to rely more on ASR, which means they have that much more to gain by investing the time for training.

Nevertheless, a Common Voice disordered speech dataset would be quite helpful, both for research, and for pre-training models that can still be personalized with further training. It is good to see (in my sibling comment) that it is being discussed.

[0] https://github.com/daanzu/kaldi-active-grammar

[1] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

[+] dabinat|5 years ago|reply
It’s not in the current dataset, but offering such a disordered speech dataset has been discussed. I imagine it’s something that will probably be offered at some point in future.
[+] totetsu|5 years ago|reply
I have heard a few people with speech disorders when validating clips. I also recall some discussion of it in the discord or issue tracker. At the moment it is entirely up to people to encourage people with voice or speech disorders to submit. So long as they meet the validation criteria they will be included. I can't see a flag in the user profile for recording a disorder, so it's not likely you can filter just these recordings from the data.
[+] sagz|5 years ago|reply
There's g.co/euphonia for those projects
[+] intopieces|5 years ago|reply
I would love to work for Mozilla on this effort full time. I have experience in voice data collection / annotation / processing at 2 FAANG companies. Anyone have an in? Thinking of reaching out to the person on who wrote this post directly.
[+] ta17711771|5 years ago|reply
Share some cool shit in their Matrix (rooms), or start a convo/give a recommendation that leads to mentioning your experience/work?
[+] jcims|5 years ago|reply
How long do you think it will be before we have personalized language/reading coaches talking to us during our morning commute to the downstairs office?
[+] codezero|5 years ago|reply
Sounds like a good move, you may check other Mozilla threads for employees open to chatting - I did that when I was curious about a position there.
[+] Polylactic_acid|5 years ago|reply
Why on earth are they using mp3 for the dataset? Its absolutely ancient and probably the worst choice possible. Opus is widely used for voice because it gets flawless results at minuscule bitrates. And don't tell me its because users find mp3 simpler because if you are doing machine learning I expect you know how to use an audio file.
[+] lunixbochs|5 years ago|reply
Probably because they're uploading (and playing back) from a webpage and Web Audio is weird and inconsistent, so sticking to a builtin codec is probably more reliable. As someone who trains on their data, it seems usable anyway. Training on 1000 hours of Common Voice makes my model better in very clear ways.

https://caniuse.com/#search=mp3

https://caniuse.com/#search=opus

I got flac working for speech.talonvoice.com with an asm codec so they could do whatever in theory, but I do get some audio artifacts sometimes.

[+] tumetab1|5 years ago|reply
[+] pjfin123|5 years ago|reply
They make it really easy to contribute! You don't need to make an account (you can though) and you read/review short sentences. I just added 75 recordings and it only took ~30 minutes. Also if you speak other languages you can contribute in them. It would really be great if there was a comprehensive public voice dataset for people to do all sorts of interesting things with.
[+] j45|5 years ago|reply
This is really encouraging to see. So nice to see languages that have more speakers than the most commonly translated languages.
[+] stergro|5 years ago|reply
The complete project is very exciting, I hope that this is really a game changer, that enables private persons and startups to create new neural networks without a big investment for the data collection.

I worked for the Esperanto dataset of common voice in the last year, and we now have collected over 80 hours in Esperanto. I hope that in a year or two we'll have collected enough data to create the first usable neural network for a constructed language and maybe the first voice assistant in Esperanto too. I will train a first experimental model with this release soon.

[+] user764743|5 years ago|reply
This is interesting. As someone who has always tons of interview data to transcribe for academic research, what TTS systems should I be looking into to help me save some time? Is Deep Speech adapted for this use?
[+] villgax|5 years ago|reply
Nice, now we need the CTC based models to run offline on low-powered devices & then pretty much all speech-to-text APIs are done for.
[+] lunixbochs|5 years ago|reply
I've been working on this. I think I can reliably hit the quality ballpark of STT APIs at the acoustic model level, but not at the language model level (word probabilities) in a low-powered-way yet.

Also, non-English models are _way_ behind still.