Ask HN: Non-cloud voice recognition for home use?
440 points| rs23296008n1 | 6 years ago | reply
I'd like a kind of echo dot like thing running on a set of raspberry pi devices each with a microphone and speaker. Ideally they'd be all over the house. I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram. Might even have two of these if required.
What options do I have? What limits? I'd really prefer answers from people who have experiences with the various options.
If it helps I'm happy to reduce vocabulary to a dictionary of words as long as I can add more words as necessary. Training is also ok. I've already analysed my voice conversations with an echo dot and the vocabulary isn't that large.
Please remember: home use, no off-site clouds. I'm not interested in options involving even a free voice speech-to-text cloud. This eliminates google voice recognition, amazon etc. They are great but out of scope.
So far I've identified CMU Sphinx as a candidate but I'm sure there are others.
Ideas?
[+] [-] romwell|6 years ago|reply
-----------------
Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.
It was not hard to slap some code together that turns on a light when someone says "banana" at a hackathon.
Sounds like exactly what you need.
>If it helps I'm happy to reduce vocabulary to a dictionary of words
You will do it with an XML grammar file for offline recognition[4].
[1]https://docs.microsoft.com/en-us/windows/iot-core/tutorials/...
[2]https://docs.microsoft.com/en-us/windows/iot-core/extend-you...
Someone's demo project:
[3]https://www.hackster.io/krvarma/rpivoice-051857
[4]https://docs.microsoft.com/en-us/windows/uwp/design/input/sp...
[+] [-] coredog64|6 years ago|reply
[0] https://github.com/spc-ofp/ObserverLengthSampler
[+] [-] lucb1e|6 years ago|reply
- The setup guide shows a Windows system making a Windows iot version. Can't I just download an iso and flash it to an sdcard with dd? Does it need a license?
- The demo projects show C# and while I can develop in monodevelop, I don't have a Windows to compile it with. Is a C# compiler included in Windows iot's .NET distribution or are there also cross-platform (interpreted) languages that run on Windows iot (e.g. Python3)?
[+] [-] rs23296008n1|6 years ago|reply
I'd be hoping I can also load in text-to-speech as well either separately or as part of same application. From what I've read the windows approach to the Pi is more like an appliance. Your application takes over the whole device. This is fine as long as I can load in more functionality to that application.
I need to read more about this.
Thanks for the pointers.
[+] [-] driverdan|6 years ago|reply
[+] [-] ftyers|6 years ago|reply
[+] [-] albertzeyer|6 years ago|reply
Sphinx is just for the automatic speech recognition (ASR) part. But there are better solutions for that:
Kaldi (https://kaldi-asr.org/) is probably the most comprehensive ASR solution, which yields very competitive state-of-the-art results.
RASR (https://www-i6.informatik.rwth-aachen.de/rwth-asr/) is for non-commercial use only but otherwise similar as Kaldi.
If you want to use a simpler ASR system, nowadays end-to-end models perform quite well. There are quite a huge number of projects which support these:
RETURNN (https://github.com/rwth-i6/returnn) is non-commercial TF-based. (Disclaimer: I'm one of the main authors.)
Lingvo (https://github.com/tensorflow/lingvo), from Google, TF-based.
ESPnet (https://github.com/espnet/espnet), PyTorch/Chainer.
...
[+] [-] daanzu|6 years ago|reply
KaldiAG has an English model available, but other models could be trained. Although you can't just drop in and use a standard Kaldi model with KaldiAG, the modifications required are fairly minimal and don't require any training or modification of its acoustic model. All recognition is performed locally and off line by default, but you can also selectively choose to do some recognition in the cloud, too.
Kaldi generally performs at the state of art. As a hybrid engine, although training can be more complicated, it generally requires far less training data to achieve high accuracy, compared to "end to end" engines.
[1] https://github.com/daanzu/kaldi-active-grammar
[+] [-] guptaneil|6 years ago|reply
What actions are you looking to handle with the assistant?
Reason I ask is because a voice assistant is a command line interface with no auto-complete or visual feedback. It doesn’t scale well as you add more devices or commands to your home, because it becomes impossible to remember all the phrases you programmed. We’ve found the person who sets up the voice assistant will use it for simple tasks like “turn off all lights” but nobody else benefits and it gets little use beyond timers and music. They are certainly nice to have, but they don’t significantly improve the smart home experience.
If you’re looking to control individual devices, I suggest taking a look at actual occupancy sensors like Hiome (https://hiome.com), which can let you automate your home with zero interaction so it just works for everyone without learning anything (like in a sci-fi movie). Even if you’re the only user, it’s much nicer to never think about your devices again.
Happy to answer any questions about Hiome or what we’ve learned helping people with smart homes in general! -> [email protected]
[+] [-] rs23296008n1|6 years ago|reply
Around 100 people need separate profiles, each should be able to set alarms, timers, reminders, etc. if they want a routine to create any of those or tell them time or date or temperature they should be able to do that from any of the voice assistants in any room. They might only want such a routine in a particular room. They should be able to define a home device and a current device. Home device would usually be a bedroom for those of us that need them etc.
I definitely don't want to have to create any of those routines etc for any of them. Nothing about these should be fixed in stone. They have to be able to self-serve. We can assume they can navigate the ios amazon app as a baseline level of knowledge.
Room settings include temperature, lighting, curtains, tv on/off, channel, volume to name a few. The voice assistant in some rooms should be able to show web pages on-screen, or even youtube etc. including the laptop someone plugged in on HDMI1.
...the coffee machine automation is also a requirement. Its controlled by a flask app. The voice control should be able to let you order a coffee, strong, black. Or a Dave#5.
We'd also like device detection to trigger when people's phones appear in certain locations.
What kinds of options exist for this?
[+] [-] TheHundredK|6 years ago|reply
[deleted]
[+] [-] byteshock|6 years ago|reply
[deleted]
[+] [-] DataDrivenMD|6 years ago|reply
Alternatively, you could just fork the Almond project directly and take it from there: https://github.com/stanford-oval/almond-cloud
[+] [-] rs23296008n1|6 years ago|reply
Thanks.
[+] [-] perturbation|6 years ago|reply
[+] [-] homarp|6 years ago|reply
with install scripts: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/scripts/je...
[+] [-] rs23296008n1|6 years ago|reply
[+] [-] nshm|6 years ago|reply
https://github.com/alphacep/vosk-api
Advantages are:
1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian
2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS
3) Install it with simple `pip install vosk`
4) Model size per language is just 50Mb
5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)
6) There are APIs for different languages too - java/csharp etc.
7) Allows quick reconfiguration of vocabulary for best accuracy.
8) Supports speaker identification beside simple speech recognition
[+] [-] rs23296008n1|6 years ago|reply
[+] [-] notemaker|6 years ago|reply
Haven't used it, but seems very nice.
https://youtu.be/ijKTR_GqWwA
[+] [-] synesthesiam|6 years ago|reply
If you're looking for something for the command-line, check out https://voice2json.org
[+] [-] stragies|6 years ago|reply
[+] [-] rs23296008n1|6 years ago|reply
[+] [-] lukifer|6 years ago|reply
http://voice2json.org/
https://nodered.org/
Requires a little customization and/or coding, but it’s quite elegant, and all voice recognition happens on-device. Part of what makes the recognition much more accurate (subjectively, 99%ish) is the constrained vocabulary; the grammars are compiled from a simple user-defined markup language, and then parsed into JSON intents, containing both the full text string and appropriate keywords/variables split out into slots.
Just finished a similar rig in my car, acting as a voice-controlled MP3 player, with thousands of artists and albums compiled into intents from iTunes XML database. Works great, and feels awesome to have a little 3-watt baby computer doing a job normally delegated to massive corporate server farms. ;)
[+] [-] stragies|6 years ago|reply
[+] [-] carbon85|6 years ago|reply
[+] [-] jvyduna|6 years ago|reply
https://makezine.com/2020/03/17/private-by-design-free-and-p...
[+] [-] awinter-py|6 years ago|reply
I think there's a group of highly technical people who feel increasingly left behind by 'convenience tech' because of what they have to give up in order to use it
[+] [-] rs23296008n1|6 years ago|reply
Loss of internet access is not an excuse for ignoring basic voice commands in my opinion.
Privacy is also an important factor but not the primary driver for us.
[+] [-] skamoen|6 years ago|reply
[1] https://mycroft.ai/
[+] [-] reaperducer|6 years ago|reply
I know virtually nothing about voice recognition, but my spidey sense tells me that it should be possible with the hardware you specify.
A Commodore 64 with a Covox VoiceMaster could recognize voice commands and trigger X-10 switches around a house. (Usually. My setup had about a 70% success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM machine should be able to do far more.
[+] [-] rs23296008n1|6 years ago|reply
[+] [-] otodic|6 years ago|reply
We license this on commercial bases but would be open to indy-developer friendly licensing. We offer a trial SDK that makes testing/evaluation super easy (it works for 15min at the time).
Ogi
[email protected]
[+] [-] winkelwagen|6 years ago|reply
[+] [-] arendtio|6 years ago|reply
[+] [-] JanisL|6 years ago|reply
https://github.com/persephone-tools
This may be a little too low level for what as there's no language model but maybe it's helpful as part of your system
[+] [-] gibs0ns|6 years ago|reply
Among those, I tried MyCroft, which still requires a cloud account to config various things on it, and it doesn't support a multi-room setup at this time.
I've since switched to Rhasspy, which offers a larger array of config options and engines, and also multi-room (I'm yet to config multi-room tho)
In the long-term I plan to "train" the voice-AI for various additions, including a custom wake word - No, I'm not calling it `Jarvis` ;)
I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-Hat` on each pi for the mic input. I'm planning to configure all the satellite nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-card and I can easily update their images/configs from a central location.
[+] [-] OlympusMonds|6 years ago|reply
I'm just starting to get going with Rhasspy, integrating with Home Assistant, and the docs miss just enough that I hit walls everytime I try.
Thanks for the info you've already provided though, sounds like I want exactly what you do.
[+] [-] villgax|6 years ago|reply
[+] [-] cjbassi|6 years ago|reply
[+] [-] teapourer|6 years ago|reply
[+] [-] coryrc|6 years ago|reply
Later I worked on the Microsoft Kinect with its 4-microphone array. With only a single microphone, it's so much harder to filter out background noise. If you don't find a system based on multiple microphones, I don't believe you can be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a system that works in only quiet conditions is possible.