top | item 15542669

Speech Recognition Is Not Solved

435 points| allenleee | 8 years ago |awni.github.io | reply

201 comments

order
[+] twothamendment|8 years ago|reply
It is one thing to hear and correctly identify the words. It is another to understand the meaning.

I've been thinking about this because my son has Auditory Processing Disorder, APD. He can hear great, even a whisper across the house. The trouble is the words don't always make sense. He can tell me the words he heard and they are correct, but assigning a meaning to them doesn't work like it does for most people.

After playing him a bunch of audio with missing words, the lady who tested him was blown away by how smart he is - explaining that he is quickly filling in almost every missing word based by guessing from all thing possible things it could be, narrowing it down on context and getting it right. I guess that is a normal, all day long task for him.

Since then I've thought about voice recognition differently. The AI to understand the context or fill in the blanks is what will make or break it.

[+] noir_lord|8 years ago|reply
I'd never heard of APD.

Interesting, all my life I've struggled to follow a conversation in a crowded environment, so much so that I actively avoid background noise with words in, I work with silicon ear plugs in or headphones and music with no lyrics.

Looking at NHS symptoms they describe me as a child, didn't learn to read until I was 8.

I nearly ended up in the remedial track but for a single awesome teacher who spotted that it wasn't that I was incapable but that I was struggling to understand in a noisy classroom, she gave up her lunch time for 6mths and taught me one on one by the end of that year I could read much better.

[+] bagacrap|8 years ago|reply
> Since then I've thought about voice recognition differently. The AI to understand the context or fill in the blanks is what will make or break it.

Of course, and all humans rely on this as well. No one hears every word perfectly all the time --- it's impossible, because the source person doesn't pronounce every word perfectly all the time. Context clues are a huge part of speech recognition, as well as gestural typing recognition and other forms of machine interpretation of human input. While it's always been a component of NLP, you can clearly see it in action with the Android keyboard these days because after you type two words in a row, the first may be corrected after you enter the second one, based on context provided by the second one.

[+] opportune|8 years ago|reply
This reminds me a lot of https://en.wikipedia.org/wiki/Prosopagnosia also known as "face blindness". People with the disorder can see perfectly fine, but have trouble recognizing people's faces, just as your son hears just fine, but has trouble recognizing people's speech. Similarly, I know someone who beat brain cancer, but after a certain point he became unable to properly taste/enjoy food.

Interestingly, these disorders all seem to have pretty direct neurological causes: there is something causing a difference in the person's brain between where they sense the information and where they process it, whether it's genetic (i.e. affecting the layout and growth of the brain) or due to trauma such as a head injury.

I'm curious as to whether your son is able to enjoy music, especially music with multiple different instruments playing simultaneously. I have a theory that the same process on a neurological level that makes food pleasurable is what makes smells smell good, music sound good, certain surface (e.g. soft fur) feel good, and certain sights (e.g. a mountain stream) look good. I mean that not from a neurotransmitter perspective, but in the actual neural processing. It seems these configurations are programmed into our brains genetically, just as our propensity toward recognizing faces/speech are, which is fascinating.

[+] thaumasiotes|8 years ago|reply
> After playing him a bunch of audio with missing words, the lady who tested him was blown away by how smart he is - explaining that he is quickly filling in almost every missing word based by guessing from all thing possible things it could be, narrowing it down on context and getting it right. I guess that is a normal, all day long task for him.

Sure it is; it's an essential part of speech recognition for all humans.

[+] wildmusings|8 years ago|reply
What you're asking for is equivalent to a general purpose intelligence, i.e. human intelligence, i.e. AI-complete. All (most?) of our ideas can be expressed in language, so in order to actually understand language, you have to actually understand any idea.
[+] p1esk|8 years ago|reply
Interesting. I wonder if your son will be smarter as a result, kinda like when neural networks generalize better if trained with a dropout.
[+] d33|8 years ago|reply
Sorry to hear that your son has APD! This makes me wonder though - do you know if this condition is language-specific? I assume you taught him English; would he have same problems with a very foreign language - say, Chinese?
[+] johnramsden|8 years ago|reply
What I would like to see you would be an open source speech recognition system that is easy to use, and works. There is really no match for the proprietary solutions in the open source world right now, and that is disappointing.

Not to say that there aren't open source speech recognition systems, they just aren't completely usable in the way proprietary solutions are. A lot of research goes into open source speech recognition, where they are lacking is in the datasets and user experience.

Hopefully, Mozilla's Common Voice project https://voice.mozilla.org/ will be successful in producing an open dataset that everyone has access to and will spur on innovation.

[+] ddevault|8 years ago|reply
I agree - I don't care how good Google gets it, this is an unsolved problem until I can do it with open source tools operating without an internet connection.
[+] LandoCalrissian|8 years ago|reply
I recently have been trying out open source solutions for voice recognition for a personal project and you are very correct that it lags very far behind proprietary solutions. Pocketsphinx is still very limited and Kaldi takes quite a bit to setup in a usable fashion. There were a few other options I looked at that I can't think of from the top of my head but were all in similar condition.

As the article says latency is still a problem and it's a huge problem in current open source solutions, some stuff I was testing was easily 5 seconds. I know that can be improved with configuration, but when dealing with libraries of 10 words or so, that's pretty bad.

I feel like anyone who is seriously interested in this space has been scooped up by all the big companies and the open source solutions have really seemed to linger because of it. It's one of the first areas I've seen where open source alternatives are really behind the proprietary solutions. Kind of bummed me out.

[+] dsacco|8 years ago|reply
Is it an issue of open source software being inadequate, or is it the lack of sufficient training data and compute power local to your home?

I’m under the impression that Google is mostly dogfooding its open source tooling for machine learning in GCP, and actually differentiates based on trained models and compute power.

[+] drzaiusapelord|8 years ago|reply
Mozilla may not be able to make much headway until these patents expire:

https://www.quora.com/Is-anyone-working-on-an-open-source-ve...

Apparently, the business model for voice services in the past has been to snap up as many broad patents as possible to keep competitors at bay. I read an interview with a google engineer a couple years back claiming the same thing. They have to carefully work around a patent minefield with their own services and how these patents are holding back better voice search on mobile technology.

[+] adrianbg|8 years ago|reply
You are correct that the problem is data. Kaldi is hard to use, but making it easier to use isn't as hard as getting good training data. Mozilla's project is a good start for some purposes. One flaw with it is that they're having people read sentences. When people read, they tend to speak more clearly than when they're figuring out what to say on the fly. This means models trained with Mozilla's data will tend to need people to speak extra clearly than eg. a model trained on conversational data.

(edits: spelling/grammar)

[+] mintplant|8 years ago|reply
Speech synthesis is in a similar situation. The FOSS options that I know of are completely primitive compared to their proprietary counterparts, particularly those locked away behind the cloud. Unfortunately this area seems to receive much less attention than speech recognition (AFAIK it's a non-goal of the Mozilla project, for example).
[+] oulipo|8 years ago|reply
I'm the cofounder of Snips.ai and we are building a 100% on-device Voice AI platform, which we want to open-source over time

You can build your voice assistants and run them for free on a Raspberry Pi 3, or Android

[+] canjobear|8 years ago|reply
It's always been a puzzle to me that published WER is so low, and yet when I use dictation (which is a lot--I use it for almost all text messages) I always have to make numerous corrections.

This sentence is crucial and suggests a way to understand how WER can be apparently lower than human levels, yet ASR is still obviously imperfect:

> When comparing models to humans, it’s important to check the nature of the mistakes and not just look at the WER as a conclusive number. In my own experience, human transcribers tend to make fewer and less drastic semantic errors than speech recognizers.

This suggests that humans make mistakes specifically on semantically unimportant words, while computers make mistakes more uniformly. That is, humans are able to allocate resources to correct word identification for the important words, with less resources going to the less important ones. So maybe the way to improve speech recognition is not to focus on WER, but on WER weighted by word importance, or to train speech recognition systems end-to-end with some end goal language task so that the DNN or whatever learns to recognize the important words for the task.

[+] adrianbg|8 years ago|reply
The low WER numbers you've probably seen are for conversational telephone speech with constrained subjects. ASR is much harder when the audio source is farther away from the microphone and when the topic isn't constrained.
[+] IanCal|8 years ago|reply
This is, to me, one of the major problems with many algorithmic solutions to problems. An x% increase does in precision, F measure or any other score does in no way mean that the results are better.

I've repeatedly seen improvements to traditional measures that make the subjective result worse.

It's incredibly hard to measure and solve (if anyone has good ideas please let me know). I check a lot of sample data manually when we make changes, doing that (with targeting at important cases) is really the only way I think to do things.

[+] skywhopper|8 years ago|reply
Yes, exactly, the raw number of word errors is a very simplistic way to judge the accuracy of a transcription. Which words were incorrect and to what degree the failures changed the meaning are ultimately far more important. And while the test described in the article is a useful way to compare progress over time, it is clearly not nearly broad enough to cover the full range of scenarios humans will rightly expect automated speech recognition that "works" to be able to handle.
[+] aidenn0|8 years ago|reply
I think some people overestimate how good humans are at speech recognition. Unfamiliar accents and noisy environments cause havoc with many people. I had a friend who learned English in India when I was in High School, so I was used to that accent; many of my classmates in College could not understand anything our Indian T/A said freshmen year.

Similarly I had friends for whom English was a second language who had lived in the US for years and were definitely fluent, but had to enable subtitles for a movie in which the characters had a strong southern accent; in general non-rhotic accents were very troublesome for them having only spoken english with midwesterners.

The article mentions the Scottish accent, and I would call that the hardest accent of native English speakers for those in the US to understand.

[+] Radim|8 years ago|reply
Just the other day, I participated in a discussion about how "language identification" is a solved problem -- in fact, hasn't it been solved for a decade?

As anyone who's had to use langid in practice will testify, it's solved only as long as:

A) you want to identify 1 out of N languages (reality: a text can be in any language, outside your predefined set)

B) you assume the text is in exactly one language (reality: can be 0, can be multiple, as is common with globalized English phrases)

C) you don't need a measure of confidence (most algos give an all-or-nothing confidence score [0])

D) the text isn't too short (twitter), too noisy (repeated sections ala web pages, boilerplate), too regional/dialect, etc.

In other words, not solved at all.

In my experience, the same is true for any other ML task, once you want to use it in practice (as opposed to "write an article about").

The amount of work to get something actually working robustly is still non-trivial. In some respects, things have gotten worse over the past years due to a focus on cranking up the number of model parameters, at the expense of a decent error analysis and model interpretability.

[0] https://twitter.com/RadimRehurek/status/872280794152054784

[+] notahacker|8 years ago|reply
I'm reminded of how often I see Twitter offer to translate English-language tweets containing a proper noun or two from absurdly unconnected languages. And that's with text containing mostly common and distinctive English words.
[+] theresistor|8 years ago|reply
Speech recognition for multi-lingual speakers is another pain point.

I live with a native French speaker, so my conversations naturally include a lot of French proper names, as well sometimes switching languages mid conversation or even mid sentence.

Lots of recognition engines can handle English and French, but treat them as mutually exclusive. It frustrates me to no end when I know that Siri can recognize a French proper name just fine if I switch it modally into French, but will botch it horribly in English.

[+] dmreedy|8 years ago|reply
Of course it's not solved. We don't even know how to define the problem.

Speech is[1] a fundamental component of Language, which is a fundamental component of Intelligence[2]. This is addressed somewhat in the conversation around semantic error rate; that there is more to processing raw audio speech than the calculus of mapping signals to tokens; some understanding of semantics and context is required to differentiate between otherwise indistinguishable surface forms.

I find it doubtful that there's a clean interface that separates the 'intelligent' parts of the brain from the 'language' parts of the brain from the 'speech' parts of the brain. This leakiness (or richness, really) means that you can't neatly solve any one part of this chain to the level of competence that the brain solves it. That means to 'solve' speech recognition, you have to 'solve' language, and thus 'solve' general intelligence. And to 'solve' general intelligence, you have to understand it, in a theoretical sense, which we don't. Indeed, it will likely involve solving all the other modalities of sensation as well. It's definitely the case that you need to have a model for prosody to understand speech. It is entirely possible that vision is a large factor as well, in the form of body language, lip reading, eye contact, and so on.

Speech recognition is quite good for what it is. For many practical applications, especially to do with young, white, newspeaker-accented English speakers who sound the most like the people who develop it, and the data sets used to develop it, it is good enough in the 80/20 sense. But it is nowhere near solved by even the least rigorous definition of the word.

----

[1] according to the philosophy I subscribe to, at any rate

[2] according to the philosophy I subscribe to, at any rate

[+] joemag|8 years ago|reply
Mistakes in understanding speech are common even among humans. My wife and I have to repeat and clarify ourselves fairly frequently in our day to day conversations. She even jokes at times that I must check my hearing, because I often mishear what she said, while she thought she was being perfectly clear.

I think where computers fall short is in two areas:

1) The rate of errors hasn’t hit the inflection point of being comparable to day-to-day intra-human interaction, and

2) There is no good mechanism for detecting and correcting errors. At least, none that I’ve seen.

That second one is important. When I hear my wife ask me “Please, hand me a tractor”, I realize that I must’ve misheard, and ask her to clarify “what?” With speech recognition, I either have to manually re-read and modify recognized text, or cancel and repeat the entire request. Both take time, and negate some of the efficiencies of using speech recognition.

[+] pbhjpbhj|8 years ago|reply
One needs to know your wife to know that "hand me a tractor" is wrong. Perhaps she makes model farms as a hobby, perhaps you work in plant rental and that means "pass me the keys to one of the tractors", the possibilities are endless. You need vision, memory, profiling, etc., to even begin to properly contextualise day-to-day conversation.
[+] cscurmudgeon|8 years ago|reply
Just the other day I was arguing (pleasantly) with someone here on HN saying that AI has solved speech rec (among other tasks).

https://news.ycombinator.com/item?id=15429287

Wish this article was written a few days earlier.

An analogue of this article exists for most other domains claimed to have been solved.

[+] throwmenow_0140|8 years ago|reply
Sorry to bother you, but the translation of your comment is pretty good (English -> Spanish -> French -> English):

  No, it's not learning new classes of objects from a single image or a few images is very difficult. See
  http://www.sciencemag.org/content/350/6266/1332.short
  The machine translation is a joke.
  Put comments on this page by translating Google into another language and go back to English and see what you get.
  I did a little part of you. Just human level for just a simple little prayer.
  > But even if you do not make this assumption, identifying the object involves spitting the distance from the performance of the human level.

source: https://news.ycombinator.com/item?id=15429862

Not to debunk your claims, just interesting to see how good translation works (even if it's not human-level, I can understand what you say).

[+] ballenf|8 years ago|reply
> Latency: With latency, I mean the time from when the user is done speaking to when the transcription is complete. ... While this may sound extreme, remember that producing the transcript is usually the first step in a series of expensive computations.

For many applications, making a transcription seems like an unnecessary step and source of errors. Skipping transcription when the user doesn't need it (most cases where I use it) would seem like a way to get some gain, but perhaps at the reduction of debuggability.

> For example in voice search the actual web-scale search has to be done after the speech recognition.

That's an area where literal exact transcription is usually required. But even then, Siri/Cortana/Alexa might be better off trying to figure out the meaning of what someone's asking rather than figuring out the exact words spoken in order to return the best results.

Most people are quite bad at formulating good internet searches without a lot of trial and error. Let Google listen to a person talk about the problem they have or issue and then formulate the best results for that instead of forcing us to come up with the exact right phrasing to get appropriate results. It would help tremendously with the synonym and homophone issues that are so annoying now.

[+] johnmcd3|8 years ago|reply
> For many applications, making a transcription seems like an unnecessary step and source of errors. Skipping transcription when the user doesn't need it (most cases where I use it) would seem like a way to get some gain, but perhaps at the reduction of debuggability.

Agree. What's really needed is research into (and development supporting) how to combine the expertise from a speech recognition layer with the next layer in a machine learning process. That higher layer contains the domain specific knowledge needed for the problem at hand, and still leverage a speech layer focused on a broad speech data set and speech-specific learning (from Google, Microsoft, the community, etc.)

Today, how richly can information be shared? I see with Google's speech API you can only share a very finite list of domain-specific expected vocabulary.

Why not have speech tools at least output sets of possible translations with associated probabilities? Do any of the top tools allow this?

Then you could at least train your next level models with the knowledge of where ambiguity most exists, and what a couple of options might have been for certain words or phrases...

[+] ccozan|8 years ago|reply
Speech Recognition, beside transcribing phonemes and match to a NN of possible words, is not solved because speech is highly integrated with the human context: who is speaking, to whom is speaking, where is the speech happening, why is the speech initiated, and so on.

My kids needed 5-6 years of continous daily talking until I could say they understand a conversation almost completly. Every single word or phrase I directed at them was spoken in a certain context and had a certain role in communicating with them when from the context it wasn't clear what my intentions were. Of course, they had their fun, throwing words away and repeating endlesly some funny word or phrase or terribly spelling them. It is interesting that you as an adult need too learn their own prononciation, at some moment in time I even wrote a small dictionary.

Again, speech recognition will simply stay at recognizing phonemes/words only for a very long time, until we have a true AI Assistant that walks with us, sees with us, and hears with us in the same time. Then it can apply some semantic and other context based related NN.

[+] pjmlp|8 years ago|reply
As Portuguese native speaker forced to use foreign languages to talk to devices, it is not solved at all.
[+] reaperducer|8 years ago|reply
Even English doesn't work the way it should. Remember Apple had to rush out a patch shortly after Siri was released because it couldn't understand Australians.

Siri still doesn't understand the majority of my Minnesota relatives.

[+] saosebastiao|8 years ago|reply
I would argue that it can't be solved without severe privacy implications. Apple's talk-to-text, for example, should know my son's name by now, but it doesn't. And while I'm mildly frustrated at the fact that I have to go back and edit text messages on a regular basis, I'm pretty glad that Apple doesn't know my son's name. I'd hate for a company like Facebook to have access to all of the proper nouns in my everyday life.
[+] turbohedgehog|8 years ago|reply
As a deaf person I am using Ava [0] which uses IBM's speech to text service [1] as its backend AFAIK. I am always impressed by how it picks up on context clues to make corrections in realtime and capitalizing proper nouns (Incredible Pizza for example). However, it does not work with multiple speakers on a single microphone.

[0] https://www.ava.me/

[1] https://www.ibm.com/watson/services/speech-to-text/

[+] amorphid|8 years ago|reply
I am a native English speaker from California. It is hard to get my Google Personal Assistant to understand the difference between desert and dessert. It's also hard to request music by an artist with name similar to another artist.

"Hey Google, play Mika radio" has a 50/50 chance of starting music by Meiko. The additional "problem" is that I like Meiko, too, but I feel obligated to cancel the Meiko music & re-request Mika so that Google (hopefully) learns to recognize the difference between Mika and Meiko.

Maybe our spoken language will start to transform into distinctly unique sounds so we can verbally interact with computers with relative ease. When I was in the US Army, I was trained to speak in a certain manner to help my communication to be more clear. [1] I don't see a reason humans and computers can't each make reasonable compromises to make verbal communication easier.

[1] https://en.wikipedia.org/wiki/Voice_procedure

[+] PaulHoule|8 years ago|reply
One thing I see missing is "conversational smarts".

People mishear a certain fraction of what they hear and they'll ask you to clarify what you said.

You could have superhuman performance at tasks like Switchboard and still have something embarrassingly bad in the field because the real task (having a conversation) is not trained for in the competitions with canned training sets.

[+] salqadri|8 years ago|reply
I remember using the Merlin Microsoft Agent to create speech-based "apps" in 2001, including text-to-speech as well as speech recognition. I made a SAT preparation that would test me by presenting a word and I had to say the meaning. I also made a Spanish learning app using Merlin, where it would read out the spanish word (in spanish!) and I would have to say the english one. It worked really well. Its been 16 years since then, and I have to say that I had expected this area to have been completely "solved" by now.
[+] brw12|8 years ago|reply
I'm constantly surprised by the poor contextual quality of speech recognition. I think the basic audio recognition does well, but when there is ambiguity, it seems like systems that are popularly considered high-performing degrade drastically. For instance, I'm using Dragon NaturallySpeaking to dictate this right now, but if I say a certain punctuation mark at the end of a sentence, half the time it's going to say excavation mark!

Ditto with Google's Google Now assistant, or whatever the heck it's called these days. I have a Pixel 2 phone (Dragon heard "pixel to phone" -- it doesn't have up-to-date context on proper nouns in the news), but when I tried to create a calendar event using "Create calendar event... meet Bruno for pizza", it heard "MIT pronoun for pizza". It has hundreds of samples of my voice, and it already knew I was creating an event! "Meet" has to be one of the most common first words used in events.

It seems to me like there is pretty low hanging fruit, and that we need more focus on flexibility and resourcefulness rather than acting as though we're moving from 99.5% accuracy to 99.6%.

[+] morinted|8 years ago|reply
I'm part of the stenographer/captioner community as a hobby and you'd be surprised at how many people think that all captions are automatically computer-generated. In reality the practical limitations of autogenerated captions are demonstrated by YouTube's caption system. It's good but even a mistake every other sentence (95%+ accuracy) can completely obfuscate the meaning behind the video.
[+] tyingq|8 years ago|reply
I've seen some pretty terrible human produced captions on television shows as well.
[+] xigency|8 years ago|reply
A big thing for me is the difference between offline and online speech recognition. This really ties in with the last point, advancements need to be efficient in order to be used.

I'm not sure how much work it would take to scale down Apple's voice recognition to run on the device or if it is feasible with their model, but currently it can take 5-10 seconds longer to get an answer from Siri during peak times.