Ask HN: Open-source voice assistants like Siri? Or can I build one on my own?
Do you guys have any tips or experience with this and how to get started? I expect there to be so me gotcha’s that I am unaware of.
The voice assistant does not need to be perfect but does need to be good. It should capture at least 80% of what I say in formal non-slang English correctly. I want to be able to speak in sentences, like I do with Siri.
What would your approach be to build or integrate this? Is it even feasible?
I am willing to invest one to two months full-time on learning the required machine learning. I currently know basic neural nets (Michael Nielsen’s book), basic statistics (e.g. logistic regression) and basic machine learning (SVM, Knn, PCA, random forest, decision trees, bag of words).
[+] [-] synesthesiam|6 years ago|reply
Rhasspy lets you describe the set of sentences you want to speak using a simple grammar with annotations for named entities (https://rhasspy.readthedocs.io/en/latest/training/#sentences...). It outputs JSON over HTTP/Websockets/MQTT, so it works well with NodeRED, Home Assistant, etc.
Disclaimer: I created and maintain Rhasspy.
[+] [-] nmstoker|6 years ago|reply
Do you have any measures of how well it recognises spoken commands?
And have you seen anyone using it with non-American accents for English? (I ask as it relies on the CMU dictionary and tools I've seen use it tend to struggle with other accents, understandably)
[+] [-] melling|6 years ago|reply
https://mycroft.ai/
https://github.com/MycroftAI
[+] [-] mehhh|6 years ago|reply
[+] [-] 52-6F-62|6 years ago|reply
https://snips.ai/developers/
https://github.com/snipsco
[+] [-] nmstoker|6 years ago|reply
If your roll your own, you'll probably still want to reuse existing components for wake words, ASR and TTS, simply training them for your specific needs.
One tip: aim for higher than that 80% target - the sentence you mention that in had 16 words, so you'd expect 3.2 errors if you read that, which will quickly get annoying (it could throw your intent recognition off completely). If you've got the ability to restrict it to a narrow vocab then you can train a language model just with the minimal words needed and that should help the word error rate dramatically.
[+] [-] beshrkayali|6 years ago|reply
After thinking about it though, I found that I don't need the voice recognition at all really. What I really wanted is a device that can help me do a few things well, mainly for my case, listen to the radio, announce calendar events, and train time (Stockholm in my case), so I just built that into a raspberry pi with a tiny screen and a few buttons. This little device is more than enough for my case.
[+] [-] kleer001|6 years ago|reply
[+] [-] eftokay83|6 years ago|reply
[+] [-] mettamage|6 years ago|reply
[+] [-] digital_voodoo|6 years ago|reply
[+] [-] ginger_beer_m|6 years ago|reply
[+] [-] rotorblade|6 years ago|reply
[+] [-] beamatronic|6 years ago|reply
[+] [-] couss|6 years ago|reply
[+] [-] sohodlers|6 years ago|reply