This might be better described as scenario-based command recognition, where a scenario is something like "Firefox", "Skype", and so on with commands specific to the scenario you're in. In other words, if you're looking to do automated voice transcription these aren't the libraries you're looking for.
This seems to be the most important/interesting part:
"There is a simple rule of thumb in speech recognition: The smaller the application domain, the better the recognition accuracy. [...] Simon can now re-configure itself on-the-fly as the current situation changes. Through "context conditions" Simon 0.4 can automatically activate and deactivate selected scenarios, microphones and even parts of your training corpus. For example: Why listen for "Close tab" when your browser isn't even open? Or why listen for anything at all when you're actually in the next room listening to music? Yes, Simon is watching you."
It is one of the best subconscious techniques humans use both for listening and reading, so it makes complete sense to implement it here. I do find the choice of words in the final sentence somewhat ominous though!
Bottom line: speech recognition in the general case (more than a few predetermined words) is only as good as the 1) acoustic model (which utterances were heard), and 2) language model (how do we group the utterances into words).
This requires massive amounts of labeled data. This is why Nuance is king and few others come close - the amount of labeled data necessary to catch up is astounding. Not to mention a patent minefield to navigate.
This is unfortunately one field in which open-source alternatives face real obstacles and won't be viable in the near future.
I don't think this is quite the sort of answer you want:
My sibling (using Debian Testing) had wrist RSI this summer and we tried to set up Simon Listens (the previous version; 0.4 looks from the release notes like it's improving). We are both Linux nerds. We were not able to get it to do anything useful after a few days of work, and I estimated a 50-50 chance that working harder on it would help[1]. I did not find any other FOSS speech-to-text that I could get working either. (FOSS Linux text-to-speech is much better; e.g. Orca is good.) We did not try any commercial products; Dragon Naturally Speaking is the only one I know of having a good reputation but it is Windows-based. Also, it's hard to integrate well into the Linux stack without being FOSS. A list of products we looked at: https://en.wikipedia.org/wiki/Speech_recognition_in_Linux
[1] Issues with compiling, dependencies, figuring out the conceptual model Simon Listens uses, trying to figure out whether Simon-not-doing-anything was because we miscompiled it, or audio input, or incompatible dep versions or misconfigured deps, or us just doing the wrong thing because the English documentation wasn't super thorough... Imagine setting up Apache, MySQL and PHP if there weren't a billion tutorials online, you'd never used Apache, and MySQL wasn't compatible with your GCC unless you pulled the git version and hoped you didn't get confused by dev-version-only bugs.
I've spent a good bit of time trying to get CMUSphinx to transcribe audio with any amount of reasonable accuracy. I never was very successful. Resorted to using paid third party APIs.
I really wish there were better options out there. Hopefully Simon will help improve the landscape. The automatic contributions to Voxforge should help.
This, and the other comments about the poor state of open-source speech recognition, are very disappointing.
Until now, I'd assumed the voice input on Android was open source. But if so it could clearly be taken and integrated into desktop apps like this. How does it actually work? Is it a closed-source plugin? Does it send a recording of what you say to a Google API?
I would love to see a good, quality, open source alternative to Nuance. They are pretty much a monopoly (patent and market wise) in this area. It also has to be open source, Nuance is known for either suing into oblivion (for patent infringement) or buying the competition.
This is actually the sort of project that will/can never be open-source. NOt because of the patents. Because its hard, requires 1000s of hours of thankless data collection. There's no shortcut to something cool - and open source loves cool demos that are easy to build.
Does anyone know if this does full speech-to-text transcription, i.e., I speak, it fills my speech into a text box? Or is it just for controlling the desktop via speech I tried googling, but couldn't come up with much.
on a side note: this is part of the HTML5 specification. If people are not afraid of Google/Chrome and if being online is not an issue, it's as simple as this: http://jsfiddle.net/dirkk0/pGFuR/
I'm really glad somebody's working on this. When my ex got RSI a couple of years ago it seemed like there was no option but to go back to Windows. There aren't many unsurmountable issues left for Linux users, but this seemed to be one of them.
I had some success (five to ten years ago now) with having a Windows machine purely to run the dictation software, and using a remote access tool like x2x or synergy to pass the keystrokes through to a Linux box which ran my actual desktop. Obviously you lose some of the application-awareness but for people who need the voice recognition but find Windows drives them up the wall it's better than nothing, and it has the incidental advantage that the dictation software isn't competing with anything else for CPU and RAM.
I would like to include a speech recognition in my commercial projects, but the license is GPL and is tied to KDE.
I've also tried sphinx before but the recognition is kind of poor and it lacks a gui for user/developer configuration of the grammars.
I wonder if speech recognition software can be developed with a :
- Dynamic Time Wrapping (DTW) algorithm for comparing utterances/words.
- A recording device for the users to record their words.
- Context separation like simon uses for limiting the phrases to listen at any time and improving accuracy.
"A recording device for the users to record their words."
Do you mean a software device like some kind of control panel? While that's a solution that eases the software developer's job, that's not how people want their software to work. I'm a software developer myself and I don't want my speech software to require training. Or if it's going to require training, fake it for me. Maybe a wizard: "Hi, I'm Simon! I need to hear your voice a bit before we get started. Please read the following sentence: ..." or something.
Sure, while this is a <1.0 release, maybe this recorder will help the devs learn their problem domain a bit more deeply, but I'd sincerely hope it doesn't become an engineer's crutch- IMO, it's Not Good expecting users to adapt themselves to the technology that's supposed to be serving them.
[+] [-] biot|13 years ago|reply
Main site: http://userbase.kde.org/Simon
[+] [-] gcr|13 years ago|reply
Just being clear, it wouldn't take you from a bucket of raw sound samples to a string like "I'd like a Coke, please" ?
[+] [-] CKKim|13 years ago|reply
"There is a simple rule of thumb in speech recognition: The smaller the application domain, the better the recognition accuracy. [...] Simon can now re-configure itself on-the-fly as the current situation changes. Through "context conditions" Simon 0.4 can automatically activate and deactivate selected scenarios, microphones and even parts of your training corpus. For example: Why listen for "Close tab" when your browser isn't even open? Or why listen for anything at all when you're actually in the next room listening to music? Yes, Simon is watching you."
It is one of the best subconscious techniques humans use both for listening and reading, so it makes complete sense to implement it here. I do find the choice of words in the final sentence somewhat ominous though!
[+] [-] plainsman|13 years ago|reply
This requires massive amounts of labeled data. This is why Nuance is king and few others come close - the amount of labeled data necessary to catch up is astounding. Not to mention a patent minefield to navigate.
This is unfortunately one field in which open-source alternatives face real obstacles and won't be viable in the near future.
[+] [-] DennisP|13 years ago|reply
[+] [-] rpm4321|13 years ago|reply
[+] [-] idupree|13 years ago|reply
My sibling (using Debian Testing) had wrist RSI this summer and we tried to set up Simon Listens (the previous version; 0.4 looks from the release notes like it's improving). We are both Linux nerds. We were not able to get it to do anything useful after a few days of work, and I estimated a 50-50 chance that working harder on it would help[1]. I did not find any other FOSS speech-to-text that I could get working either. (FOSS Linux text-to-speech is much better; e.g. Orca is good.) We did not try any commercial products; Dragon Naturally Speaking is the only one I know of having a good reputation but it is Windows-based. Also, it's hard to integrate well into the Linux stack without being FOSS. A list of products we looked at: https://en.wikipedia.org/wiki/Speech_recognition_in_Linux
[1] Issues with compiling, dependencies, figuring out the conceptual model Simon Listens uses, trying to figure out whether Simon-not-doing-anything was because we miscompiled it, or audio input, or incompatible dep versions or misconfigured deps, or us just doing the wrong thing because the English documentation wasn't super thorough... Imagine setting up Apache, MySQL and PHP if there weren't a billion tutorials online, you'd never used Apache, and MySQL wasn't compatible with your GCC unless you pulled the git version and hoped you didn't get confused by dev-version-only bugs.
[+] [-] sunsu|13 years ago|reply
I really wish there were better options out there. Hopefully Simon will help improve the landscape. The automatic contributions to Voxforge should help.
[+] [-] karterk|13 years ago|reply
[+] [-] graue|13 years ago|reply
Until now, I'd assumed the voice input on Android was open source. But if so it could clearly be taken and integrated into desktop apps like this. How does it actually work? Is it a closed-source plugin? Does it send a recording of what you say to a Google API?
[+] [-] unknown|13 years ago|reply
[deleted]
[+] [-] rdtsc|13 years ago|reply
[+] [-] JoeAltmaier|13 years ago|reply
[+] [-] dakota|13 years ago|reply
Also, there is no Wikipedia page for them!
[+] [-] dirkk0|13 years ago|reply
[+] [-] Joeboy|13 years ago|reply
[+] [-] pm215|13 years ago|reply
[+] [-] smogzer|13 years ago|reply
I wonder if speech recognition software can be developed with a : - Dynamic Time Wrapping (DTW) algorithm for comparing utterances/words. - A recording device for the users to record their words. - Context separation like simon uses for limiting the phrases to listen at any time and improving accuracy.
[+] [-] delinka|13 years ago|reply
Do you mean a software device like some kind of control panel? While that's a solution that eases the software developer's job, that's not how people want their software to work. I'm a software developer myself and I don't want my speech software to require training. Or if it's going to require training, fake it for me. Maybe a wizard: "Hi, I'm Simon! I need to hear your voice a bit before we get started. Please read the following sentence: ..." or something.
Sure, while this is a <1.0 release, maybe this recorder will help the devs learn their problem domain a bit more deeply, but I'd sincerely hope it doesn't become an engineer's crutch- IMO, it's Not Good expecting users to adapt themselves to the technology that's supposed to be serving them.
[+] [-] lifelongUU|13 years ago|reply
[+] [-] hippich|13 years ago|reply
[+] [-] taf2|13 years ago|reply
"If you are a packager and would like to package Simon 0.4, please do get in touch with us. Thank you."
[+] [-] Egregore|13 years ago|reply