top | item 2936371

Speech Recognition Leaps Forward

77 points| Garbage | 14 years ago |research.microsoft.com | reply

33 comments

order
[+] brandonb|14 years ago|reply
The really great thing about deep networks isn't that they're more accurate. It's that they're radically simpler.

Current speech recognizers are basically layer upon layer of tricks discovered by researchers over the course of decades. Chop up the input signal. Then take a Fourier transform. Take the log to even the signal out. Do another transform to de-correlate different components of the audio. Add noise to the input. Project down to a subspace. Switch objective functions halfway through training to trade off different kinds of errors. Use more Guassians here. Use fewer there. Pump it into a language model.

It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.

The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.

This paper is actually incremental, not a "leap forward." They've basically replaced two of the middle layers of a speech recognizer (the Gaussian mixture model and hidden Markov model) with a modified neural network. But the exciting thing is that the neural network can start there, and slowly eat its way toward the outer layers, replacing a big stack of hacks with one simple algorithm.

[+] exit|14 years ago|reply
> It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.

i'm not sure about this attitude. it reminds me of a quote by dijkstra:

"The question of whether Machines Can Think... is about as relevant as the question of whether Submarines Can Swim."

why demand that intelligence proceed from a single parsimonious gesture?

[+] Jach|14 years ago|reply
I kind of second the earlier replier with not really liking the attitude here:

>It works, and it's a marvel of engineering, but it's not "artificial intelligence." It's pretty much a big stack of statistical hacks piled up over the years.

>The nice thing is that a deep belief network can figure out a lot of this structure automatically, much closer to how the brain works.

Really, the brain works very much like a neural net? I was under the impression it was hacked together by the statistical process known as evolution stacked over many years... I'm wondering if this idea of "this time it's not a mere math hack!" is a case of the 'Lemon Glazing Fallacy': http://lesswrong.com/lw/vv/logical_or_connectionist_ai/

I do agree with you that it's hardly a leap forward. Marketing is fun.

[+] Aron|14 years ago|reply
It's always nice when you can shed a bunch of complexity with something simple, because then you can start adding complexity again.
[+] bh42222|14 years ago|reply
It works, and it's a marvel of engineering, but it's not "artificial intelligence."

Is sound like you don't like complex algorithms written by humans. But you do like a big bucket of "neural network"?

How is "The magical black box works somehow!" better than "We know exactly how this white box we built works."?

[+] kondro|14 years ago|reply
Is it just me or does 18% seem like a high error rate - and this is after improvement?

I've used technologies (Nuance??) that have significantly lower errors rates than this, even for systems I have not trained personally. Is there something I'm missing?

[+] romanows|14 years ago|reply
The difference in error rates is in large part due to the to the difference between dictated speech and spontaneous, informal conversational speech.

Switchboard (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=...) is a set of telephone conversations between two people. Speakers tend to say a lot of "ums", abruptly restart an utterance in progress, talk past the telephone handset, etc. Dictated speech, especially when speakers know they're talking to a computer, has less acoustic and linguistic noise.

[+] braindead_in|14 years ago|reply
Nuance is speaker dependent. You have to train the system to understand the speaker's voice. This is speaker independent which is much harder.
[+] krmmalik|14 years ago|reply
I was going to ask the same thing. It doesn't seem like that significant an improvement but I'm no expert. Also I do wonder if the performance improvements are mostly due to gpu acceleration as opposed to a switch to a different software model.
[+] huffo|14 years ago|reply
Benchmarks have always fascinated me :)
[+] urlwolf|14 years ago|reply
Does anyone know if this will impact applications soon enough to matter to the typical startup that could benefit from better speech recognition?
[+] hollerith|14 years ago|reply
Probably not. This web page is from the PR department of Microsoft Research. The probability is low enough even if it had come from researchers, not PR types.
[+] kd1220|14 years ago|reply
No. I worked at a small IVR systems company in 2000 and at Nuance in 2001. I also worked with the tech during my undergraduate years. My opinion on speech recognition is that it's very pie-in-the-sky and not yet ready for general applications. I don't say this because the technology itself isn't ready; it's that humans aren't ready for it.

Having stated my bias: Speech recognition systems are actually not that complex at their core. It's a blending of statistical models. Getting good data is a problem. You need a good acoustic model that's adapted to your users and the environment in which they will be using your application. Everything from the fluency of speakers, to physical environment, to the characteristics of the channel over which the speech is sent needs to be considered.

If you have a good acoustic model, now you have to worry about your language model. Are you going to try to accept all words in a language, or just restrict your users to a particular domain of language? If you have a good language model, then you need to worry about the dialog management. How do you keep context in a conversation? It's not an easy problem.

The primary problem with speech recognition systems is that human beings set their expectations of them too high. It's a psychological factor. When those expectations are not met, the user is frustrated and angry. Consider this. Whenever you call AT&T, your health insurance company, or credit card company, do you enjoy the experience of the IVR system that routes your call? Probably not. You probably don't even talk to it and resort to pressing the buttons instead. Unfortunately that's the experience most people have with speech recognition. I think it's the worst possible application of it.

If you're making a small, toy application whose vocabulary is pretty restricted and whose functionality set is small, then you're probably okay. If you venture into full dialog/anything-goes type applications, the chances are high that your app will be a bomb.

These researchers can swap out all the lower-level statistical models they want, but it won't fundamentally improve the technology. There are systems out there with word error rates very close to that of humans, but the systems higher up in the stack that interpret what is recognized are still very crude.