top | item 42467194

The era of open voice assistants

931 points| _Microft | 1 year ago |home-assistant.io

278 comments

order
[+] Jarwain|1 year ago|reply
I'm actually really excited for this!

I noticed recently there weren't any good open source hardware projects for voice assistants with a focus on privacy. There's another project I've been thinking about where I think the privacy aspect is Important, and figuring out a good hardware stack has been a Process. The project I want to work on isn't exactly a voice assistant, but same ultimate hardware requirements

Something I'm kinda curious about: it sounds like they're planning on a sorta batch manufacturing by resellers type of model. Which I guess is pretty standard for hardware sales. But why not do a sorta "group buy" approach? I guess there's nothing stopping it from happening in conjunction

I've had an idea floating around for a site that enables group buys for open source hardware (or 3d printed items), that also acts like or integrates with github wrt forking/remixing

[+] pimeys|1 year ago|reply
I'm also very excited. I've had some ESP32 microphones before, but they were not really able to understand the wake word, sometimes even when it was quiet and you were sitting next to the mic.

This one looks like it can recognize your voice very well, even when music is playing.

Because... when it works, it's amazing. You get that Star Trek wake word (KHUM-PUTER!), you can connect your favorite LLM to it (ChatGPT, Claude Sonnet, Ollama), you can control your home automation with it and it's as private as you want.

I ordered two of these, if they are great, I will order two more. I've been waiting for this product for years, it's hopefully finally here.

[+] IgorPartola|1 year ago|reply
A group buy for an existing product makes sense. Want to buy a 24TB Western Digital hard drive? It’s $350. But if you and your 1000 closest friends get together the price can be $275.

But for a first time unknown product? You get a lot fewer interested parties. Lots of people want to wait for tech reviews and blog posts before committing to it. And group buys being the only way to get them means availability will be inconsistent for the foreseeable future. I don’t want one voice assistant. I want 5-20, one for every space in my house. But I am not prepared to commit to 20 devices of a first run and I am not prepared to buy one and hope I’ll get the opportunity to buy more later if it doesn’t flop. Stability of the supply chain is an important signal to consumers that the device won’t be abandoned.

[+] Brendinooo|1 year ago|reply
I invested in Mycroft and it flopped. Here’s hoping some others can go where they couldn’t.
[+] interludead|1 year ago|reply
Your idea about group buys is really intriguing. I wonder if the community might organically set something like that up once there’s enough interest
[+] choffee|1 year ago|reply
Not really sure what the benefit of group buy would be here. Nuba Casa, the company that supports the development of home assistant and developed this product, already has a few products they sell. They had this stocked all over the world for the announcement and it sold out. I assume they had already made a few thousand. They will get more stock now and it will sell just like the other things they make. Any profit from this will go back into development of Home Assistant.
[+] thumbsup-_-|1 year ago|reply
We need more projects like home assistant. I started using it recently and was amazed. They sell their own hardware but the whole setup is designed to works on any other hardware. There are detailed docs for installation on your own hardware. And, it works amazingly well.

Same for their voice assistant. You can but their hardware and get started right away or you can place your own mics and speakers around home and it will still work. You can but your own beefy hardware and run your own LLM.

The possibilities with home assistant are endless. Thanks to this community for breaking the barriers created by big tech

[+] mkagenius|1 year ago|reply
I am working on automation of phones (open source) - https://github.com/BandarLabs/clickclickclick

I haven't been able to quite get the Llama vision models working but I suppose with new releases in future, it should work as good as Gemini in finding bounding boxes of UI elements.

[+] lokar|1 year ago|reply
It’s a great project overall, but I’ve been frustrated by how anti-engineer it has been trending.
[+] interludead|1 year ago|reply
Completely agree! Home Assistant feels like a breath of fresh air in a space dominated by big tech's walled gardens.
[+] joshstrange|1 year ago|reply
It's too bad it's sold out everywhere. I've tried the ESP32 projects (little cube guy) for voice assistants in HA but it's mic/speaker weren't good enough. When it did hear me (and I heard it) it did an amazing job. For the first time I talked to a voice assistant that understood "Turn off office lights" to mean "Turn off all the lights in the office" without me giving it any special grouping (like I have to do in Alexa and then it randomly breaks). It handled a ton of requests that are easy for any human but Alexa/Siri trip up on.

I cannot wait to buy 5 or more of these to replace Alexa. HA is the brain of my house and up till now Alexa provided the best hardware to interact with HA (IMHO) but I'd love something first-party.

[+] moffkalast|1 year ago|reply
I'm definitely buying one for robotics, having a dedicated unit for both STT and TTS that actually works and integrates well would make a lot of social robots more usable and far easier to set up and maintain. Hopefully there's a ROS driver for it eventually too.
[+] bdavbdav|1 year ago|reply
How did you find it for music tasks?
[+] steelframe|1 year ago|reply
If it's possible for the hardware to facilitate a use case, the employees working on the product will try to push the limits as far as they possibly can in order to manufacture interesting and challenging problems that will get them higher performance ratings and promotions. They will rationalize away privacy violations by appealing to their "good intentions" and their amazing ability to protect information from nefarious actors. In their minds they are working for "the good guys" who will surely "do the right thing."

At various times in the past, the teams involved in such projects have at least prototyped extremely invasive features with those in-home devices. For example, one engineer I've visited with from a well-known in-home device manufacturer worked on classifiers that could distinguish between two people having sex and one person attacking another in audio captured passively by the microphones.

As the corporate culture and leadership shifts over time I have marginal confidence that these prototypes will perpetually remain undeveloped or on-device only. Apple, for instance, has decided to send a significant amount of personal data to their "Private Cloud" and is taking the tactic of opening "enough" if its infrastructure for third-party audit to make an argument that the data they collect will only be used in a way that the user is aware and approves of. Maybe Apple can get something like that to a good enough state, at least for a time. However, they're inevitably normalizing the practice. I wonder how many competitors will be as equally disciplined in their implementations.

So my takeaway is this: If there exists a pathway between a microphone and the Internet that you are not in 100% control over, it's not at all unreasonable to expect that anything and everything that microphone picks up at any time will be captured and stored by someone else. What happens with that audio will -- in general -- be kept out of your knowledge and control so long as there is insufficient regulatory oversight.

[+] jfim|1 year ago|reply
That's a pretty timely release considering Alexa and the Google assistant devices seem to have plateaued or are on the decline.
[+] IgorPartola|1 year ago|reply
Curious what you mean by that.
[+] frognumber|1 year ago|reply
I don't fully understand the cloud upsell. I have a beefy GPU. I would like to run the "more advanced" models locally.

By "I don't fully understand," I mean just that. There's a lot of marketing copy, but there's a lot I'd like to understand better before plopping down $$$ for a unit. The answers might be reasonable.

Ideally, I'd be able to experiment with a headset first, and if it works well, upgrade to the $59 unit.

I'd love to just have a README, with a getting started tutorial, play, and then upgrade if it does what I want.

Again: None of this is a complaint. I assume much of this is coming once we're past preview addition, or is perhaps there and my search skills are failing me.

[+] antonyt|1 year ago|reply
You can do exactly that - set up an Assist pipeline that glues together services running wherever you want, including a GPU node for faster-whisper. The HA interface even has a screen where you can test your pipeline with your computer’s microphone.

It’s not exactly batteries-included, and doesn’t exercise the on-device wake word detection that satellite hardware would provide, but it’s doable.

But I don’t know that the unit will be an “upgrade” over most headsets. These devices are designed to be cheap, low-power, and have to function in tougher scenarios than speaking directly into a boom mic.

[+] trb|1 year ago|reply
Finding microphones that look nice, can pick up voice at high enough quality to extract commands and that cover an entire room is surprisingly hard.

If this device delivers on audio quality it's totally worth it at $59.

[+] choffee|1 year ago|reply
This device is just the mic/speaker/wakeword part. It connects to home-assistant to do the decoding and automation. You can test it right now by downloading home-assistant and running it on a pi or a VM. You can run all the voice assist stuff locally if you want. There are services for the voice to text, text to voice and what they call intents which are simple things like "turn off the lights in the office". The cloud offering from Nuba Casa, not only funds the development of Home Assistant but also give remote access if you want it. As part of that you can choses to offload some of the voice/text services to their cloud so that if you are just running it on a Pi it will still be fast.
[+] Jarwain|1 year ago|reply
I can't speak to home assistant specifically, but the last time I looked at voice models, supporting multiple languages and doing it Really Well just happens to require a model with a massive amount of RAM, especially to run at anything resembling real-time.

It's be awesome if they open sourced that model though, or published what models they're using. But I think it unlikely to happen because home assistant is a sorta funnel to nabu casa

That said, from what I can find, it sounds like Assist can be run without the hardware, either with or without the cloud upgrade. So you could definitely use your own hardware, headset, speakers, etc. to play with Assist

[+] nickthegreek|1 year ago|reply
The cloud sale is easy if you are an HA user already. If you don’t use Home Assistant right now, you probably rec it the target audience. I purchase the yearly cloud service as it’s an easy way to support HA development. It also gives you remote access to your system without having to do any setup. It provides an https connection which allows you to program esp32 devices through Chrome. And now they added the ability to do TTS and STT on someone else’s hardware. HA even allows you to setup a local llm for house control commands but route other queries directly to the cloud.
[+] Havoc|1 year ago|reply
Had to laugh a bit at the caveat about powerful hardware. Was bracing myself for GPU and then it says N100 lol
[+] moooo99|1 year ago|reply
I mean, comparatively many people are hosting their home Assistant on an raspberry Pi so it is relatively powerful :D
[+] amluto|1 year ago|reply
One thing that makes me nervous: Home Assistant has an extremely weak security model. There is recent support for admin users, and that’s about it. I’m sort of okay with the users on an installation having effectively unrestricted access to all entities and actions. I’m much less okay with an LLM having this sort of access.

An actually good product in this space IMO needs to be able to define specific sets of actions and allow agents to perform only the permitted actions.

[+] Ey7NFZ3P0nzAe|1 year ago|reply
You can already choose which entity to expose to the LLMs
[+] fons|1 year ago|reply
I wonder how this compares to the Respeaker 2 https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/

The respeaker has 4 mics and can easily cancel out the noise introduced by a custom external speaker

[+] robotfelix|1 year ago|reply
It's worth noting that product is listed in the "Discontinued Products" section of the linked wiki.

Both of the ReSpeaker products in the non-discontinued section (ReSpeaker Lite, ReSpeaker 2-Mics Pi HAT) have only 2 mics, so it appears that things are converging in that direction.

[+] stavros|1 year ago|reply
I don't just want the hardware, I want the software too. I want something that will do STT on my speech, send the text to an API endpoint I control, and be able to either speak the text I give it, or live stream an audio response to the speakers.

That's the part I can't do on my own, and then I'll take care of the LLMs myself.

[+] IshKebab|1 year ago|reply
Looks great! The biggest issue I see is music. 90% of my use is "play some music" but none of the major streaming music providers offer APIs for obvious reasons. I'm not sure how you can get around that really.
[+] hamilyon2|1 year ago|reply
I had great trouble simply connecting Bluetooth speaker to use it as voice input and for sound output. The overall state of sound subsystem for diy voice assistant feels third-class at best.
[+] shaklee3|1 year ago|reply
As someone not that familiar with haas, can someone explain why there's not a clear path to replace Alexa or Google home? I considered using haas recently to get a gpt like response after being frustrated with Google home, but it seems this is a complete mess. is there a way to get this yet?
[+] joshstrange|1 year ago|reply
> explain why there's not a clear path to replace Alexa or Google home?

There is. I've used HA with their default assist pipeline (Cloud HA STT, Cloud HA LLM, Cloud HA TTS) and I've also plugged in different providers at each step (both remote and local for each part: STT/LLM/TTS) and it's super cool. Their default LLM isn't great but it works, plugging in OpenAI made it work way better. My local models weren't great in speed but I don't have hardware dedicated for this purpose (currently), seeing an entire local pipeline was amazing for the promise of it in the future. It's too slow (on my hardware) but we are so close to local models (SST/TTS could be improved as well but they are much easier to do already locally).

If this new HA hardware comes even close to performing as well as the Echo's in my house (low bar) I'll replace them all.

[+] fx1994|1 year ago|reply
What I don't like is that most voice assistances perform really bad on my native language so I don't use them at all. For english speakers yes, but for all other not so much. I guess it will get better.
[+] ryukoposting|1 year ago|reply
My wife and I have been very happy with Home Assistant so far. The one thing we're missing is voice control, and until now it seemed like there just wasn't a clean solution for HA voice control. You were stuck doing some hobbyist shenanigans and hand-writing boatloads of YAML, or you were hooking up a HomeKit/Alexa which defeats the purpose of HA. This is a game-changer.

They recommend an N100 in the blog post, but I might buy one anyway to see if my HA box's Celeron J3455 will do the job.

[+] jauntywundrkind|1 year ago|reply
Not super convinced the XMOS audio processing chip is really gonna buy a lot. Trying to do audio input processing feels like a dynamic task, requiring such adaption. XMOS is the most well known audio processor and a beast, but not sure it's really gonna help here!

I really hope we see some open-source machine -learned systems emerge.

I saw Insta360 announce their video conferencing solution today. Optics looks pretty medium, nothing wild, but Insta360 is so good at video that I expect it'll be great. But there's a huge 14 microphone array on it, and that's the hard job; figuring out how to get good audio from speakers in a variety of locations around a room. It really made me wish for more open source footing here, some promising start, be it the conference room or open living space. I've given all of 60s to look through this, and was kinda hopeful because heck yeah Home Assistant, but my initial read isn't super promising, isn't that this is starting the proper software base needed to listen well to the world.

https://petapixel.com/2024/12/17/the-insta360-connect-is-a-2...

[+] hoppp|1 year ago|reply
If it runs fully on premise that would be great. Im still not comfortable buying a device that records everything I say and uploads it to a cloud
[+] haddonist|1 year ago|reply
Fully on-prem can be done if you've got the LLM compute power in place.
[+] nailer|1 year ago|reply
You should talk to Sonos about partnering with them. They currently have a very limited Sonos voice assist, plus Google Voice and Alexa, but the latter two are limited pre-LLM assistants.

I’m assuming they eventually want to create their own LLM and something privacy focused would be good match for their customers. I don’t know how they feel about open source though

[+] bradly|1 year ago|reply
Are there any MacOS software versions of this? I've been looking for opensource wake-work for a "Hey Siri"-like integration, but I'm very apprehensive of anything, malicious or not, monitoring the sound input for a specific word in an efficient way.
[+] ragmondo|1 year ago|reply
My plea / request : Make a home assistant a DROP IN replacement for a standard light switch. It has power, its adds functionality from the get-go (smart lighting), it’s placed in a convenient position for the room and no extra wires etc required.
[+] mkagenius|1 year ago|reply
Though a separate hardware helps - I believe voice and automation can be integrated more seamlessly to our existing devices (phones/laptops) with high compute built in.

Llama and whisper are already public so that should help innovation in this area.