top | item 2940668

Ask HN: Why don't we use subtitled films/tv to train speech recognition?

30 points| sycren | 14 years ago | reply

There are thousands of films and tv episodes that have subtitles throughout their duration. Millions of music that are sung that we can find lyrics for. Would it not be possible to use this material to train speech recognition. This would then make it possible to train in the multiple different dialects and accents of a particular language.

Speech recognition as a technology, has always appeared to move slowly although with the advent of mobile popularity, the technology is becoming increasingly popular.

Is anyone doing anything like this?

34 comments

[+] tcarnell|14 years ago|reply

I used to work for a company that built Speech Recognition systems and I came up with a similar/related idea - the idea being to take a load of videos of barack obama (for example), and create an accurate 'voice print'. Once done, any videos or speech could be scanned and if Barack Obama's voice print was recognized/detected, the recognizer could be tuned to his voice print AND could apply a set of appropriate grammars/vocabluary (for example the 'politics' grammar, or 'american' grammar or 'economics' grammar) - then you could very accurately perform speech recognition and automatically create text translations. Then when you google for text, you could actually retrieve videos whose content exactly matches the search terms and jump directly to that part of the video.

Over time you could build up a database of voice prints and grammars for not just celebrities, policitians, but also criminals (for automatic identification).

I had this idea almost 4 years ago, submitted it to the company, but it wasn't taken seriously.

If anybody is interested in this, let me know!

[+] molloye|14 years ago|reply

Google is doing something comparable with Google Voice and Search recognition transcriptions, inviting corrections both manually and by using similar techniques to spell correct in text search.

I suspect a lack of data is not the biggest challenge in improving speech recognition

[+] amirmc|14 years ago|reply

> when you google for text, you could actually retrieve videos whose content exactly matches the search terms and jump directly to that part of the video

The search aspect of this is very interesting and I hadn't thought of it before (though in hindsight it seems like an obvious benefit).

[+] molloye|14 years ago|reply

Google is doing something comparable with Google Voice and Search recognition transcriptions, inviting corrections both manually and by using similar techniques to spell correct in text search.

I suspect a lack of data is the biggest challenge in improving speech recognition

[+] josefresco|14 years ago|reply

Execute the idea for yourself. Good ideas are a dime-a-dozen.

[+] nvictor|14 years ago|reply

i'm interested.

[+] eftpotrm|14 years ago|reply

Aside from issues relating to background noise on the soundtrack, the subtitles are frequently abridged from the spoken word in the interests of space and / or readability, so you'd need to account for that in your algorithm.

If it were me... Project Gutenberg has free books available in both audio and text formats. You may well again run into issues with the spoken and written text not exactly matching (it's not something I've looked into to know) but I wouldn't be surprised if it was rather less than what I've observed in subtitles, and the data concerned is in a more easily parsed format.

[+] rcthompson|14 years ago|reply

Audio recordings of book readings are less practical than subtitles because they are not synchronized. Every subtitle in a film is associated with the sound clip that plays while it is visible, whereas for an audiobook or similar, any algorithm would have to "align" the audio and text in order to obtain usable training data, and then it would have to deal with the errors introduced by this process.

[+] sycren|14 years ago|reply

As 0x12 states further down, noise can be seen as beneficial. By having such a huge dataset, perhaps it would be possible to advance the technology of speech recognition to transcribe speech in busy places as needed for in mobile applications where the user is not in a quiet room.

[+] killa_bee|14 years ago|reply

I happen to know that they do this at the Linguistics Data Consortium (http://www.ldc.upenn.edu/), at least with cable news shows. They mostly do that to obtain data for languages with more minimal resources though, and for the purposes of transcription, not for speech recognition qua engineering research. The real issue though is the research community is interested in increasing the accuracy of recognizers on standard datasets by developing better models, not increasing accuracy per se. Having used more data isn't publishable. Further, in terms of real gains, the data is sparse (power law distributed), and so we need more than just a constant increase in the amount of data. This issue is general to any machine-learning scenario but is particularly pronounced in anything built on language.

Some related papers ~ Moore R K. 'There's no data like more data (but when will enough be enough?)', Proc. Inst. of Acoustics Workshop on Innovation in Speech Processing, IoA Proceedings vol.23, pt.3, pp.19-26, Stratford-upon-Avon, 2-3 April (2001). Charles Yang. Who's afraid of George Kingsley Zipf? Ms., University of Pennsylvania. http://www.ling.upenn.edu/~ycharles/papers/zipfnew.pdf

[+] hartror|14 years ago|reply

Well I am sure they would do, though subtitles aren't the most reliable source for movie dialog. Often the dialog is altered subtly to fit the space and timing requirements.

[+] fraser|14 years ago|reply

I've had the subtitles turned on for about a year now and it wouldn't take more than 2 hours of watching broadcast TV with subtitles to realize this isn't a good solution. I've noticed the following.

1. Audio track is censored, Subtitles are not or Vice/Versa. 2. Actors Improvise the audio, the Subtitles are based on the script. 3. English Translations were done by the cheapest person possible so lots of partial words because they weren't clear and the transcriber didn't understand the context. 4. A recent show (2011) seemed to have a symbol every other character, I'm not sure if this is a Double-Byte Character issue, or just a bad translation. 5. Several shows such as American Idol and America's Got Talent display song lyrics and I'm not sure but I would think singing would require changes to the Algorithm.

I wish you well with the idea, but now you have a little more information.

[+] tintin|14 years ago|reply

And they are translations not speech-2-text.

[+] sycren|14 years ago|reply

How about music lyrics?

[+] drKarl|14 years ago|reply

There are two different but correlated fields: Speech recognition and Natural Language Understanding.Speech Recognition is easier if the scope is minimised, that is, if th system knows which subset of keywords of orders to recognized. But recognizing an open scope, including different accents, slang, etc is a much more difficult task.

[+] sycren|14 years ago|reply

I mean there must be millions of times where a character has said 'hello' in a film or tv episode. Each person may have a slightly different way of saying it which can then be used to make a model for speech recognition software which may no longer require the user to train the software.

It may also be possible to automate the entire process as we have both the audio and the words spoken at a particular time.

Take it a step further, we have millions of sung songs with lyrics that can also be used. Its a gold mine of information that can be repurposed.

[+] tcarnell|14 years ago|reply

Very interesting. In a similar vein, automated language translation could be assisted too - beacuase a film DVD often has different audio and subtitle languages, so it would be possible to pair up semantically similar audio and written content... and put it all into a magic computer

[+] mooism2|14 years ago|reply

For the purposes of speech recognition, songs strike me as being particularly noisy.

[+] 0x12|14 years ago|reply

That's a plus though, on the testing front. Once it works with random sections from songs that were not part of the training set that would be a significant improvement over what we have today.

The problem would be the disproportionate weights given to the words 'I', 'love', 'you', 'baby'. Songs are probably not the best training data when it comes to getting a well rounded vocabulary.

[+] uniclaude|14 years ago|reply

True, we may think about acappellas then.

[+] SandB0x|14 years ago|reply

A similar idea: "Learning sign language by watching TV (using weakly aligned subtitles" from CVPR 2009:

http://www.comp.leeds.ac.uk/me/Publications/cvpr09_bsl.pdf

[+] adsahay|14 years ago|reply

For films and music the audio data may have too much noise, but TV programmes with low background noise (news, documentary, interview) with available Closed Captions (CC) are good training sources. CC transcripts are enforced by broadcasting regulators so they should be highly accurate.

The big problem with using these sources is the huge vocabulary. Speech recognition works better for smaller vocabularies than bigger.

[+] fbnt|14 years ago|reply

http://voxforge.com have been collecting a big speech corpora over the last few years, under GPL license. That should be the way to follow imho.

Training a speech recognition engine is quite a sophisticated process, and usually requires at least a clean (not noisy) set of samples, which you can't find in dubbed movies and surely not in music.

[+] detst|14 years ago|reply

Google had a service (6 or more years ago) that would search TV shows (that they were recording themselves) and provide back the transcripts and thumbnail images for any matches. I suspect this was used as training data.

[+] unknown|14 years ago|reply

[deleted]