top | item 22576589

Ask HN: Non-cloud voice recognition for home use?

440 points| rs23296008n1 | 6 years ago | reply

I'd like a home-based voice recognition without some off-site cloud.

I'd like a kind of echo dot like thing running on a set of raspberry pi devices each with a microphone and speaker. Ideally they'd be all over the house. I'm happy if they talk back via wifi to a server in my office for whatever real processing. The server might have 16 cores and 128Gb ram. Might even have two of these if required.

What options do I have? What limits? I'd really prefer answers from people who have experiences with the various options.

If it helps I'm happy to reduce vocabulary to a dictionary of words as long as I can add more words as necessary. Training is also ok. I've already analysed my voice conversations with an echo dot and the vocabulary isn't that large.

Please remember: home use, no off-site clouds. I'm not interested in options involving even a free voice speech-to-text cloud. This eliminates google voice recognition, amazon etc. They are great but out of scope.

So far I've identified CMU Sphinx as a candidate but I'm sure there are others.

Ideas?

127 comments

order
[+] romwell|6 years ago|reply
TL; DR: Win 10 IoT for RasPi does it.

-----------------

Windows 10 IoT for Raspberry Pi comes with offline speech recognition API.

It was not hard to slap some code together that turns on a light when someone says "banana" at a hackathon.

Sounds like exactly what you need.

>If it helps I'm happy to reduce vocabulary to a dictionary of words

You will do it with an XML grammar file for offline recognition[4].

[1]https://docs.microsoft.com/en-us/windows/iot-core/tutorials/...

[2]https://docs.microsoft.com/en-us/windows/iot-core/extend-you...

Someone's demo project:

[3]https://www.hackster.io/krvarma/rpivoice-051857

[4]https://docs.microsoft.com/en-us/windows/uwp/design/input/sp...

[+] coredog64|6 years ago|reply
The Microsoft offline speech recognizer is pretty good. I did some work with it many years ago [0]. The only problem we had was with accents: My French co-worker had to use his most obnoxiously over-the-top American accent for reasonable accuracy. ISTR that we could switch to Australian English for the Aussies and Kiwis.

[0] https://github.com/spc-ofp/ObserverLengthSampler

[+] lucb1e|6 years ago|reply
This is really interesting, but I have a few questions:

- The setup guide shows a Windows system making a Windows iot version. Can't I just download an iso and flash it to an sdcard with dd? Does it need a license?

- The demo projects show C# and while I can develop in monodevelop, I don't have a Windows to compile it with. Is a C# compiler included in Windows iot's .NET distribution or are there also cross-platform (interpreted) languages that run on Windows iot (e.g. Python3)?

[+] rs23296008n1|6 years ago|reply
Sounds good to me. RasPi are solid performers to us. I'm assuming the XML would need to be updated as the dictionary changes. Sounds easy enough. The loading of languages might get fussy/impossible if I want multiple. A stretch goal is to support multiple languages from the same device.

I'd be hoping I can also load in text-to-speech as well either separately or as part of same application. From what I've read the windows approach to the Pi is more like an appliance. Your application takes over the whole device. This is fine as long as I can load in more functionality to that application.

I need to read more about this.

Thanks for the pointers.

[+] driverdan|6 years ago|reply
Does the IoT version track everything you do and cram ads down your throat like the regular version of Win 10?
[+] albertzeyer|6 years ago|reply
Are you searching for a complete solution including NLP and an engine to perform actions? Some of these are already posted, like Home Assistant, and Mycroft.

Sphinx is just for the automatic speech recognition (ASR) part. But there are better solutions for that:

Kaldi (https://kaldi-asr.org/) is probably the most comprehensive ASR solution, which yields very competitive state-of-the-art results.

RASR (https://www-i6.informatik.rwth-aachen.de/rwth-asr/) is for non-commercial use only but otherwise similar as Kaldi.

If you want to use a simpler ASR system, nowadays end-to-end models perform quite well. There are quite a huge number of projects which support these:

RETURNN (https://github.com/rwth-i6/returnn) is non-commercial TF-based. (Disclaimer: I'm one of the main authors.)

Lingvo (https://github.com/tensorflow/lingvo), from Google, TF-based.

ESPnet (https://github.com/espnet/espnet), PyTorch/Chainer.

...

[+] daanzu|6 years ago|reply
I develop Kaldi Active Grammar [1], which is mainly intended for use with strict command grammars. Compared to normal language models, these can provide much better accuracy, assuming you can describe (and speak) your command structure exactly. (This is probably more acceptable for a voice assistant for an audience that is more technical.) The grammar can be specified by a FST, or you can use KaldiAG through Dragonfly, which allows you to specify them (and their resultant actions) in Python. However, KaldiAG can also do simple plain dictation if you want.

KaldiAG has an English model available, but other models could be trained. Although you can't just drop in and use a standard Kaldi model with KaldiAG, the modifications required are fairly minimal and don't require any training or modification of its acoustic model. All recognition is performed locally and off line by default, but you can also selectively choose to do some recognition in the cloud, too.

Kaldi generally performs at the state of art. As a hybrid engine, although training can be more complicated, it generally requires far less training data to achieve high accuracy, compared to "end to end" engines.

[1] https://github.com/daanzu/kaldi-active-grammar

[+] guptaneil|6 years ago|reply
Disclaimer: I am the founder of Hiome, a smart home startup focused on private by design local-only products.

What actions are you looking to handle with the assistant?

Reason I ask is because a voice assistant is a command line interface with no auto-complete or visual feedback. It doesn’t scale well as you add more devices or commands to your home, because it becomes impossible to remember all the phrases you programmed. We’ve found the person who sets up the voice assistant will use it for simple tasks like “turn off all lights” but nobody else benefits and it gets little use beyond timers and music. They are certainly nice to have, but they don’t significantly improve the smart home experience.

If you’re looking to control individual devices, I suggest taking a look at actual occupancy sensors like Hiome (https://hiome.com), which can let you automate your home with zero interaction so it just works for everyone without learning anything (like in a sci-fi movie). Even if you’re the only user, it’s much nicer to never think about your devices again.

Happy to answer any questions about Hiome or what we’ve learned helping people with smart homes in general! -> [email protected]

[+] rs23296008n1|6 years ago|reply
Sure. We've got a house with multiple buildings, including sheds, halls etc.

Around 100 people need separate profiles, each should be able to set alarms, timers, reminders, etc. if they want a routine to create any of those or tell them time or date or temperature they should be able to do that from any of the voice assistants in any room. They might only want such a routine in a particular room. They should be able to define a home device and a current device. Home device would usually be a bedroom for those of us that need them etc.

I definitely don't want to have to create any of those routines etc for any of them. Nothing about these should be fixed in stone. They have to be able to self-serve. We can assume they can navigate the ios amazon app as a baseline level of knowledge.

Room settings include temperature, lighting, curtains, tv on/off, channel, volume to name a few. The voice assistant in some rooms should be able to show web pages on-screen, or even youtube etc. including the laptop someone plugged in on HDMI1.

...the coffee machine automation is also a requirement. Its controlled by a flask app. The voice control should be able to let you order a coffee, strong, black. Or a Dave#5.

We'd also like device detection to trigger when people's phones appear in certain locations.

What kinds of options exist for this?

[+] perturbation|6 years ago|reply
If you don't mind getting your hands dirty a bit, I think Nvidia's model [Jasper](https://arxiv.org/pdf/1904.03288.pdf) is near SOTA, and they have [pretrained models](https://ngc.nvidia.com/catalog/models/nvidia:jaspernet10x5dr) and [tutorials / scripts](https://nvidia.github.io/NeMo/asr/tutorial.html) freely available. The first is in their library "nemo", but they also have it available in [vanilla Pytorch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/P...) as well.
[+] rs23296008n1|6 years ago|reply
Do you have any experience/opinions on those?
[+] nshm|6 years ago|reply
You are welcome to try Vosk

https://github.com/alphacep/vosk-api

Advantages are:

1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian

2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS

3) Install it with simple `pip install vosk`

4) Model size per language is just 50Mb

5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)

6) There are APIs for different languages too - java/csharp etc.

7) Allows quick reconfiguration of vocabulary for best accuracy.

8) Supports speaker identification beside simple speech recognition

[+] notemaker|6 years ago|reply
[+] synesthesiam|6 years ago|reply
Rhasspy author here in case you have any questions :)

If you're looking for something for the command-line, check out https://voice2json.org

[+] rs23296008n1|6 years ago|reply
The description from a superficial read looks good. Thanks!
[+] lukifer|6 years ago|reply
I’m currently assembling an offline home assistant setup using Node-RED and voice2json, all running on Raspberry Pi’s:

http://voice2json.org/

https://nodered.org/

Requires a little customization and/or coding, but it’s quite elegant, and all voice recognition happens on-device. Part of what makes the recognition much more accurate (subjectively, 99%ish) is the constrained vocabulary; the grammars are compiled from a simple user-defined markup language, and then parsed into JSON intents, containing both the full text string and appropriate keywords/variables split out into slots.

Just finished a similar rig in my car, acting as a voice-controlled MP3 player, with thousands of artists and albums compiled into intents from iTunes XML database. Works great, and feels awesome to have a little 3-watt baby computer doing a job normally delegated to massive corporate server farms. ;)

[+] stragies|6 years ago|reply
Hi Lukifer, thanks for chiming in! I had a setup using snips, that i'm looking to replace. Please do document your setup, and your little helper scripts in a blogblost or such, and ping me/us :)
[+] awinter-py|6 years ago|reply
important question

I think there's a group of highly technical people who feel increasingly left behind by 'convenience tech' because of what they have to give up in order to use it

[+] rs23296008n1|6 years ago|reply
Well I've got all the gadgets controllable now without internet as a requirement. Only the voice part requires it now for us at least. Google home/Amazon echos and phone apps can communicate with the house and surrounds without issue.

Loss of internet access is not an excuse for ignoring basic voice commands in my opinion.

Privacy is also an important factor but not the primary driver for us.

[+] skamoen|6 years ago|reply
I've read good things about Mycroft [1], though I haven't tried it myself. Ticks all the boxes though

[1] https://mycroft.ai/

[+] reaperducer|6 years ago|reply
I wish you luck with this, and more importantly, hope that it inspires many people to start building similar projects.

I know virtually nothing about voice recognition, but my spidey sense tells me that it should be possible with the hardware you specify.

A Commodore 64 with a Covox VoiceMaster could recognize voice commands and trigger X-10 switches around a house. (Usually. My setup had about a 70% success rate, but pretty good for the time!) Surely a 16 core, 128GB RAM machine should be able to do far more.

[+] rs23296008n1|6 years ago|reply
Its beginning to take shape. Already got a bunch of good candidates for experimentation next week based on the answers so far.
[+] otodic|6 years ago|reply
My company develops SDKs for on-device speech recognition on Android/iOS: https://keenresearch.com/keenasr-docs (Raspberry Pi is an option too, we'll have a GA release in Q2)

We license this on commercial bases but would be open to indy-developer friendly licensing. We offer a trial SDK that makes testing/evaluation super easy (it works for 15min at the time).

Ogi

[email protected]

[+] winkelwagen|6 years ago|reply
I've had some good experience with https://snips.ai . Works like advertised. easy to implement. Hardest thing was getting the microphone and the pie to get along.
[+] arendtio|6 years ago|reply
Did you visit the website lately? Doesn't seem to be an option anymore :-/
[+] JanisL|6 years ago|reply
I was one of the maintainers of the Persephone project which is an automated phonetic transcription tool. This came about from a research project that required a non-cloud solution. This project is open source and can be found on GitHub:

https://github.com/persephone-tools

This may be a little too low level for what as there's no language model but maybe it's helpful as part of your system

[+] gibs0ns|6 years ago|reply
I was in the process of planning my multi-room voice-AI setup based on SnipsAI (to be integrated with Home Assistant) when it was announced they were bought by Sonos, which killed their open source project. Since then I have been left trying various projects that meet my needs.

Among those, I tried MyCroft, which still requires a cloud account to config various things on it, and it doesn't support a multi-room setup at this time.

I've since switched to Rhasspy, which offers a larger array of config options and engines, and also multi-room (I'm yet to config multi-room tho)

In the long-term I plan to "train" the voice-AI for various additions, including a custom wake word - No, I'm not calling it `Jarvis` ;)

I'm running each of these voice-AI's on a Raspberry Pi 4 (4GB model), though I'm considering switching them to Pi 3's. I'm using the `ReSpeaker 2mic Pi-Hat` on each pi for the mic input. I'm planning to configure all the satellite nodes (voice-AI in each room) to PXE boot, that way they don't require an sd-card and I can easily update their images/configs from a central location.

[+] OlympusMonds|6 years ago|reply
Is it possible to make your config available, e.g., GitHub?

I'm just starting to get going with Rhasspy, integrating with Home Assistant, and the docs miss just enough that I hit walls everytime I try.

Thanks for the info you've already provided though, sounds like I want exactly what you do.

[+] villgax|6 years ago|reply
Google has papers on device speech recognition, these are used in the keyboard & for live caption on Pixel devices.
[+] teapourer|6 years ago|reply
They are trained on a ton of non-public data though, and I’m not sure if pre-trained models are around.
[+] coryrc|6 years ago|reply
I tried to use Julius for this. I may have misconfigured it, but it would always match something to what it was hearing. I encoded some sounds in my grammar to error terms that it would detect in quiet noise (like 'aa' and 'hh'), but it would still occasionally match words when nothing was going on.

Later I worked on the Microsoft Kinect with its 4-microphone array. With only a single microphone, it's so much harder to filter out background noise. If you don't find a system based on multiple microphones, I don't believe you can be successful if there's any ongoing noise (dishwasher, loud fans, etc), but a system that works in only quiet conditions is possible.