He is wrong about the accuracy of speech recognition. I know he is wrong because I have actually used Speech Recognition to get real work done. AND to make it worse I have a crap Norwegian accent.
MAYBE it is as bad as he describes when you do not train the system and just start speaking. But that is like complaining about Emacs after using it for only 10 minutes. Retarded.
I wouldn't be able to code with it though, but for producing prose in English. It. Actually. Works.
(I recognize that the problems described in the post is about recognition "in the wild" and not with one person in one quiet room with one microphone like I use it)
For getting a wordy essay or report done, voice transcription is far and away the best option. You can talk 150-170 wpm, but can only type 60-80 wpm. I have a friend who swears by Dragon, and has used it to crank out more papers in a shorter time.
I think the reason people haven't adopted voice transcription is because (1) they don't realize that some of the programs out there have reached the 99% accuracy level, (2) we all have developed the habit of working with a computer with our fingers - breaking habits is hard and (3) -related to 2- our brains have been trained to think as we type in ways that allow us to get certain work done, and if we moved over to getting our work done via voice, there are neural connections we'd have to re-create through training.
The promess of Voice Recognition has never been the actual transcripting of the spoken word - although even that would be quite useful if a high degree of accuracy is achieved - but rather in the idea of controlling a computer by simply talking to it using natural language. People want the computer to behave as if it is a colleague, not a machine.
Of course we're a long way from that goal. Firstly, people don't understand just how difficult it is to express clearly what it is that you want. I have a good friend that had bought an iPhone whilst she was living abroad for 6 months. On returning back home, she wanted to be able to add new songs to her iPhone without losing the stuff already on there. That's how she described the task to me, as the go-to person for computer problems. So I started asking a bunch of questions - do you still have the same computer that you were using abroad? Do you want copy the music already on the iPhone onto your computer? What about other data on the iPhone, do you need to recover that too, or can I blow it away? She got very frustrated with all of these questions and finally just snapped "Oh, you know what I mean, I just want to be able to use my iPhone normally, including adding new stuff to it!"
Sigh. She really didn't (doesn't) understand that this just isn't enough information for me to work with - and that's me, as a walking talking human being with a strong understanding of what the computer is doing. How is a computer, with problems in transcribing the spoken words, let alone understanding the underlying meaning of those words, and not having any idea of the real world context of the problem, supposed to figure out what she wanted?
There are some cases where this makes sense but there are also a lot of situations where I want my computer to be just that - not a colleague. Simply because many of the tasks that I perform on it are trivial and repetitive and easy to trigger with keystrokes and mouse clicks. It would be annoying to me and the rest of the office if I was speaking each command.
I worked with a coder about 10 years ago who had been looking into speech recognition for video games. He came up with some pretty interesting prototypes that he ultimately had to throw away. His companies legal department told him speech recognition is a legal minefield with most algorithms patented and some very trigger happy lawyers defending those patents.
In the end they went with an inferior solution from one of the big hardware manufacturers who coincidentally were also mentioned by the legal team as a patent holder.
I don't think this is an issue, at least in academia. The one case where patent has been an issue is speech synthesis, where the PSOLA method has been patented and was the state of the art for quite some time.
I am not an expert on speech recognition, but I have done my PhD in a speech lab, so I have at least an idea on the issues. Almost every current speech recognizer is divided into two parts: the acoustical modeling (from audio file to "phonemes"), and language models (from "phonemes" to words).
Most people working on acoustical modeling are interested in difficult situations (multi speakers, noisy speech, etc... where performances are nowhere near human in general).
The language modeling part has to deal with sparsity: the idea is to model the probability of getting a work w_n given that the previous words were (w_n-1, w_n-2, ...). Given that vocabularies (numbers of possible words) are of the order of millions for large vocabulary recognition, you can imagine that many (most) combinations are never seen in the training data (trillions and more combinations), so you have to somewhat "smooth" the data, remove early "impossible" combinations., etc... There are a lot of heuristics.
Also, even though the author is a bit off on the timing, he is right that the basic methods are the same for a long time (statistical, data-driven, HMM for the acoustical model). Sure, we now have (somewhat) speaker-independent models so that you don't have to train the model with your own voice for hours anymore, and language models can handle large vocabularies, but the basics are really the same. A researcher who would have hibernated 20 years ago and just woke up would be able to update in no time.
Proposing a new system is difficult, because the current systems are extremely hard to beat: the state of the art requires thousand of hours of training, which means you are almost required to use a lot of programs outside your own lab. For example, almost everyone uses the same software for acoustical modelling (HTK), same labeled data, etc... The published improvements are often tiny (less than one point, e.g. going from 79 % to 80 %), and I actually wonder if those are scientifically significant.
Using things like pitch and other non verbal cues (prosody is the actual term used in the literature) is often suggested, but estimating those reliably is extremely difficult. I would guess this will improve because some of languages which are currently funded require prosodic information. Chinese is the obvious example: Chinese has the notion of tones, where the pitch of phonemes may radically change the meaning of the word.
On the bright side, I think speech recognition was one of the first field which pushed the idea of data-driven models (at least 3 decades when the first usage of HMM appear), and as such, has been on the forefront of the current explosion of non trivial statistical models trained on big datasets. In that sense, I think it had non trivial contribution outside its own application.
I think we get the impression at times that no matter what there is always an army of people quietly working away on this stuff and as time passes it will get better and better.
The article Jeff linked talks about how how the performance has pretty much levelled off except for small incremental improvements and most of the money and research was shifted away from the area. I guess the trend will continue, gaining small improvements through applying more and more data as training for current methods until someone comes up with a completely different approach to push it forward.
Robert Fortner's article "Rest in Peas: The Unrecognized Death of Speech Recognition" was discussed recently on HN, a must read for those interested in speech recognition.
So many "new interface" ideas seem to forget that a large percentage of what people use computers for is in offices, producing documentation. Using your voice, or spinning 3D interfaces around with your hands, just isn't practical in an office environment with dozens of people working, where you are essentially outputting text.
Until that basic reality changes, I don't think mainstream computing needs will change. Along the same lines, I don't think the reverse is feasible, with computing interface changes bringing about social changes. The shifts we've seen from hand-writing to type-writing to computer-based word-processing has been happening in roughly the same environment for a very long time.
Not to state that the exception proves the rule, but my company actually makes use of two racks of IVR servers on a daily basis and we love them. Speeds up the inbound call process ten fold from 2009.
I despise these things, and I don't know of anyone who disagrees. If you're seeing an improvement, then I guess most of the world treats them differently than I do.
As soon as I hear "If you want X, please say...", I start pounding the zero button. The thing is, relying on buttons is not only faster in itself, but also doesn't force me through so many "you said 'X'; is that correct?" questions.
If it were something straightforward I would already have helped myself on your website. The fact that I'm on the phone means that it's an open-ended question requiring a real human on the other end (or that your website is useless).
What fails is general-purpose voice recognition. Constrained voice recognition, like "recognize these 10 digits and four options" works fine, though it still needs more metaphorical hand holding than a human would. ("OK, please say a number." "Two." "... ... ... I think you said two? If so, please say yes." "Yes." "... ... ... OK, I guess you're done talking. Now let's move to the next option." Humans don't need the "... ... ...".)
I think there have been numerous acquisitions of voice/speech-recognition startups by both Google and Apple (http://www.enterprisemobiletoday.com/news/article.php/387924...), because indeed it's very clear that it's going to be a thing of the future, with nlp.
And we're probably not much into it just because of how impenetrably tortuous it is. Almost everything related to it is variable -- culture is always evolving language, slangs die of usage over time, and there are accents to worry about. With all of these problems, there is the demand in tandem for progress in nlp areas to deal with -- where really some of the most difficult challenges lie ( http://en.wikipedia.org/wiki/Natural_language_processing#Con... ). With all that said, I have my money down on Google. They seem to be doing a lot of work in areas where the variability of these tasks is required. Presently, the voice transcription feature on youtube seems fairly impressive, and I've noticed in the past google search's nlp abilities to be curiously good, certainly more far ahead than any other search engine today.
Can anyone tell me why we aren't able to progress into better levels of recognition. Am I correct to assume it has nothing to do with computing power, and everything to do with (semantic/linguistic) software?
It's more than just linguistic software. Our knowledge of linguistics itself is currently very limited; it's a nacent science and there's still a great deal of debate about how to even approach the study of language. Even leaving aside the difficulties in just transcribing speech, linguists are still a long ways off from coming up with any formalism of human syntax that could help create software to syntactically parse normal human speech.
Most commenters are focusing on relatively high level features of decoding speech. It is important to also be aware that there is still great debate about what are the acoustic correlates of linguistic events in speech. It seems that our words are composed of subunits (usually taken to be phones out of the IPA--but there's work on alternatives) but what exact acoustics correspond to the phones are is still unsettled: lots of debate and mediocre recognition performance
Undoubtedly there is much room for improvement on these higher-level features but computers are still well behind humans in large vocabulary isolated keyword spotting: this is a task where one word from a very large corpus of words is spoken and the human or computer has to guess what that word was. Computers do poorly relative to humans (particularly in noise), which suggests that many of the mistakes that computers make is in not being able to interpret the acoustics correctly.
[+] [-] gahahaha|15 years ago|reply
MAYBE it is as bad as he describes when you do not train the system and just start speaking. But that is like complaining about Emacs after using it for only 10 minutes. Retarded.
I wouldn't be able to code with it though, but for producing prose in English. It. Actually. Works.
(I recognize that the problems described in the post is about recognition "in the wild" and not with one person in one quiet room with one microphone like I use it)
[+] [-] WalterGR|15 years ago|reply
As have I. The dictation capability built in to Vista / Windows 7 is really quite amazing. (They both also have voice control.)
I'm an extremely fast typer, but dictating - even with time spent correcting mistakes - is quite a bit faster.
[+] [-] adammichaelc|15 years ago|reply
I think the reason people haven't adopted voice transcription is because (1) they don't realize that some of the programs out there have reached the 99% accuracy level, (2) we all have developed the habit of working with a computer with our fingers - breaking habits is hard and (3) -related to 2- our brains have been trained to think as we type in ways that allow us to get certain work done, and if we moved over to getting our work done via voice, there are neural connections we'd have to re-create through training.
[+] [-] demallien|15 years ago|reply
Of course we're a long way from that goal. Firstly, people don't understand just how difficult it is to express clearly what it is that you want. I have a good friend that had bought an iPhone whilst she was living abroad for 6 months. On returning back home, she wanted to be able to add new songs to her iPhone without losing the stuff already on there. That's how she described the task to me, as the go-to person for computer problems. So I started asking a bunch of questions - do you still have the same computer that you were using abroad? Do you want copy the music already on the iPhone onto your computer? What about other data on the iPhone, do you need to recover that too, or can I blow it away? She got very frustrated with all of these questions and finally just snapped "Oh, you know what I mean, I just want to be able to use my iPhone normally, including adding new stuff to it!"
Sigh. She really didn't (doesn't) understand that this just isn't enough information for me to work with - and that's me, as a walking talking human being with a strong understanding of what the computer is doing. How is a computer, with problems in transcribing the spoken words, let alone understanding the underlying meaning of those words, and not having any idea of the real world context of the problem, supposed to figure out what she wanted?
[+] [-] kristiandupont|15 years ago|reply
[+] [-] dazzawazza|15 years ago|reply
In the end they went with an inferior solution from one of the big hardware manufacturers who coincidentally were also mentioned by the legal team as a patent holder.
So maybe real research is being hampered?
[+] [-] arethuza|15 years ago|reply
[+] [-] cdavid|15 years ago|reply
I am not an expert on speech recognition, but I have done my PhD in a speech lab, so I have at least an idea on the issues. Almost every current speech recognizer is divided into two parts: the acoustical modeling (from audio file to "phonemes"), and language models (from "phonemes" to words).
Most people working on acoustical modeling are interested in difficult situations (multi speakers, noisy speech, etc... where performances are nowhere near human in general).
The language modeling part has to deal with sparsity: the idea is to model the probability of getting a work w_n given that the previous words were (w_n-1, w_n-2, ...). Given that vocabularies (numbers of possible words) are of the order of millions for large vocabulary recognition, you can imagine that many (most) combinations are never seen in the training data (trillions and more combinations), so you have to somewhat "smooth" the data, remove early "impossible" combinations., etc... There are a lot of heuristics.
Also, even though the author is a bit off on the timing, he is right that the basic methods are the same for a long time (statistical, data-driven, HMM for the acoustical model). Sure, we now have (somewhat) speaker-independent models so that you don't have to train the model with your own voice for hours anymore, and language models can handle large vocabularies, but the basics are really the same. A researcher who would have hibernated 20 years ago and just woke up would be able to update in no time.
Proposing a new system is difficult, because the current systems are extremely hard to beat: the state of the art requires thousand of hours of training, which means you are almost required to use a lot of programs outside your own lab. For example, almost everyone uses the same software for acoustical modelling (HTK), same labeled data, etc... The published improvements are often tiny (less than one point, e.g. going from 79 % to 80 %), and I actually wonder if those are scientifically significant.
Using things like pitch and other non verbal cues (prosody is the actual term used in the literature) is often suggested, but estimating those reliably is extremely difficult. I would guess this will improve because some of languages which are currently funded require prosodic information. Chinese is the obvious example: Chinese has the notion of tones, where the pitch of phonemes may radically change the meaning of the word.
On the bright side, I think speech recognition was one of the first field which pushed the idea of data-driven models (at least 3 decades when the first usage of HMM appear), and as such, has been on the forefront of the current explosion of non trivial statistical models trained on big datasets. In that sense, I think it had non trivial contribution outside its own application.
[+] [-] robryan|15 years ago|reply
The article Jeff linked talks about how how the performance has pretty much levelled off except for small incremental improvements and most of the money and research was shifted away from the area. I guess the trend will continue, gaining small improvements through applying more and more data as training for current methods until someone comes up with a completely different approach to push it forward.
[+] [-] garribas|15 years ago|reply
http://news.ycombinator.com/item?id=1313679
[+] [-] superdavid|15 years ago|reply
Until that basic reality changes, I don't think mainstream computing needs will change. Along the same lines, I don't think the reverse is feasible, with computing interface changes bringing about social changes. The shifts we've seen from hand-writing to type-writing to computer-based word-processing has been happening in roughly the same environment for a very long time.
[+] [-] iamdave|15 years ago|reply
[+] [-] CWuestefeld|15 years ago|reply
As soon as I hear "If you want X, please say...", I start pounding the zero button. The thing is, relying on buttons is not only faster in itself, but also doesn't force me through so many "you said 'X'; is that correct?" questions.
If it were something straightforward I would already have helped myself on your website. The fact that I'm on the phone means that it's an open-ended question requiring a real human on the other end (or that your website is useless).
[+] [-] jerf|15 years ago|reply
[+] [-] jimmyjim|15 years ago|reply
And we're probably not much into it just because of how impenetrably tortuous it is. Almost everything related to it is variable -- culture is always evolving language, slangs die of usage over time, and there are accents to worry about. With all of these problems, there is the demand in tandem for progress in nlp areas to deal with -- where really some of the most difficult challenges lie ( http://en.wikipedia.org/wiki/Natural_language_processing#Con... ). With all that said, I have my money down on Google. They seem to be doing a lot of work in areas where the variability of these tasks is required. Presently, the voice transcription feature on youtube seems fairly impressive, and I've noticed in the past google search's nlp abilities to be curiously good, certainly more far ahead than any other search engine today.
[+] [-] edo|15 years ago|reply
[+] [-] compay|15 years ago|reply
[+] [-] mstoehr|15 years ago|reply
Undoubtedly there is much room for improvement on these higher-level features but computers are still well behind humans in large vocabulary isolated keyword spotting: this is a task where one word from a very large corpus of words is spoken and the human or computer has to guess what that word was. Computers do poorly relative to humans (particularly in noise), which suggests that many of the mistakes that computers make is in not being able to interpret the acoustics correctly.
[+] [-] ashbrahma|15 years ago|reply