top | item 40346995

Show HN: Pi-C.A.R.D, a Raspberry Pi Voice Assistant

344 points| nkaz123 | 1 year ago |github.com

Pi-card is an AI powered voice assistant running locally on a Raspberry Pi. It is capable of doing anything a standard LLM (like ChatGPT) can do in a conversational setting. In addition, if there is a camera equipped, you can also ask Pi-card to take a photo, describe what it sees, and then ask questions about that.

It uses distributed models so latency is something I'm working on, but I am curious on where this could go, if anywhere.

Very much a WIP. Feedback welcome :-)

92 comments

order

rkagerer|1 year ago

I wanted to create a voice assistant that is completely offline and doesn't require any internet connection. This is because I wanted to ensure that the user's privacy is protected and that the user's data is not being sent to any third party servers.

Props, and thank you for this.

pyaamb|1 year ago

I would love for Apple/Google to introduce some tech that would make it provable/verifiable that the camera/mic on the device can only be captured when the indicator is on and that it isn't possible for apps or even higher layers of OS to spoof this

squarefoot|1 year ago

+1. That's the #1 feature I want in any "assistant".

A question: does it run only on the Pi5 or other (also non Raspberry Pi) boards?

rob74|1 year ago

Extra kudos for the name - and extra extra for using the good old "Picard facepalm" meme.

But seriously - the name got my attention, then I read the introduction and thought "hey, Alexa without uploading everything you say to Amazon? This might actually be something for me!".

> The default wake word is "hey assistant" - I would suggest "Computer" :) And of course it should have a voice that sounds like https://en.wikipedia.org/wiki/Majel_Barrett

pawelduda|1 year ago

All I need is a voice assistant that: - RPi 4 can handle, - I can integrate with HomeAssistant, - is offline only, and doesn't send my data anywhere.

This project seems to be ticking most, if not all of the boxes, compared to anything else I've seen. Good job!

While at it, can someone drop a recommendation for a Rpi-compatible mic for Alexa-like usecase?

baobun|1 year ago

Check out Rhasspy.

You won't get anything practically useful running LLMs on the 4B but you also don't strictly need LLM-based models.

In the Rhasspy community, a common pattern is to do (cheap and lightweight) wake-word detection locally on mic-attached satellites (here 4B should be sufficient) and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

yangikan|1 year ago

I have got good results with Playstation 3 and Playstation 4 cameras (which also have a mic). They are available for $15-20 on ebay.

jhbruhn|1 year ago

Have you checked out the Voice Assistant functionalities integrated into HA? https://www.home-assistant.io/voice_control/

NabuCasa employed the Rhasspy main dev to work on these functionalities and they are progressing with every update.

eddieroger|1 year ago

> Why Pi-card? > Raspberry Pi - Camera Audio Recognition Device.

Missed opportunity for LCARS - LLM Camera Audio Recognition Service, responding to the keyword "computer," naturally. I guess if this ran elsewhere from a Pi, it could be LCARS.

rkagerer|1 year ago

Pi-C.A.R.D is perfect. Read it 100% as Picard, and more recognizable that LCARS.

bdcravens|1 year ago

Or LLM Offline Camera, User Trained Understanding Speech

LOCUTUS

layer8|1 year ago

It should be really something like Beneficial Audio Realtime Recognition Electronic Transformer.

ethagnawl|1 year ago

I'm looking forward to trying this. Hopefully this gains traction, as (AFAIK) an open, reliable, flexible, privacy focused voice assistant is still sorely needed.

About a year ago, my family was really keen on getting an Alexa. I don't want Bezos spy devices in our home, so I convinced them to let me try making our own. I went with Mycroft on a Pi 4 and it did not go well. The wake word detection was inconsistent, the integrations were lacking and I think it'd been effectively abandoned by that point. I'd intended to contribute to the project and some of the integrations I was struggling with but life intervened and I never got back to it. Also, thankfully, my family forgot about the Alexa.

genewitch|1 year ago

some "maker" products that were sold at target had a cardboard box with an arcade RGB-LED button on top, a speaker, and 4 microphones on a "hat" for a rpi... nano? pico? whatever the one that is roughly the size of a sodimm. It didn't have a wakeword, you'd push the white-lit button, it would change color twice, once to acknowledge the press, and again to indicate it was listening. It would change color once you finished speaking and then it would speak back to you.

It used some google thing on the backend, and it was really frustrating to get set up and keep working - but it did work.

i have two of those devices, so i've been waiting for something to come that would let me self-host something similar.

nkaz123|1 year ago

I'm really inspired reading this and hope it can help! I'm planning to put more work into this. I have a few rough demos of it in action on youtube (https://www.youtube.com/watch?v=OryGVbh5JZE) which should give you an idea of the quality of it at the moment.

knodi123|1 year ago

Is it possible to run this on a generic linux box? Or if not, are you aware of a similar project that can?

I've googled it before, but the space is crowded and the caveats are subtle.

CaptainOfCoit|1 year ago

Raspberry-pi is very much like a generic Linux box, biggest difference being that it's ARM rather than Intel/AMD CPU which makes things slightly less widely supported.

But overall, Pi-C.A.R.D seems to be using Python and cpp so shouldn't be any issues whatsoever to run this on whatever Python and cpp can be run/compiled on.

MH15|1 year ago

I tried to build this on an early gen RPI 4 about three years ago- but ran in to limitations in the hardware (and in my own knowledge). Super cool to see it happening now!

robbyiq999|1 year ago

It would be cool to see some raspi hats you could plug a GPU into, though unsure of how practical or feasible that would be. Todays graphics cards are tomorrows e-waste, perhaps they could get a second life beefing up a diy raspi project like this

piltdownman|1 year ago

Most of the USP, outside of the ecosystem developed around the single platform, is to do with form and power draw. Adding in the GPU/adapter/PSU to leverage cheap CUDA cores probably works out worse power/price/form wise than going for a better SoC or x86 NUC solution.

genewitch|1 year ago

for crypto-mining you'd convert a single PCIe slot into 4 x1 PCIe slots - or just have a board with 12+ x1 PCIe slots. I'm not sure what magic goes in to PCIe, but i do know that at least one commodity board had "exposed" PCIe interface - the Atomic Pi.

anyhow the GPU would sit on a small PCB that would connect to an even smaller PCB in the PCIe slot on the motherboard, via USB3 cable. My point here is merely that whatever PCIe is, it can be transported to a GPU to do work via USB3 cables.

harwoodr|1 year ago

I see that a speaker is in the hardware list - does this speak back?

nkaz123|1 year ago

Yes! I'm currently using https://espeak.sourceforge.net/, so it isn't especially fun to listen to though.

Additionally, since I'm streaming the LLM response, it won't take long to get your reply. Since it does it a chunk at a time, there's occasionally only parts of words that are said momentarily. Also of course depends on what model you use or what the context size is for how long you need to wait.

stereosteve|1 year ago

Why does Picard always have to specify temperature preference for his Earl Gray tea? Shouldn’t the super smart AI have learned his preference by now?

dragonwriter|1 year ago

> Why does Picard always have to specify temperature preference for his Earl Gray tea?

Completely OT, but, most likely, he doesn’t. Lots of people in the show direct with the replicators more fluidly; “Tea, Earl Grey, Hot” seems like a Picard quirk, possibly developed with more primitive food/beverage units than the replicators on the Enterprise-D.

lttlrck|1 year ago

Perhaps he must be specific to override a hard, lawsuit proof, default that is too tepid for his tastes.

Will there still be lawsuits in the post-scarcity world? Probably.

TeMPOraL|1 year ago

Force of habit?

Most of Starfleet folks seem to not know how to use their replicators well anyway. For all the smarts they have, they use it like a mundane appliance they never bothered to read the manual for, miss 90% of its functionality, and then complain that replicated food tastes bad.

pkaye|1 year ago

Well the one time he just said "Tea, Earl Grey" the computer assumed, "Tea, Earl Grey, luke warm".

aci_12|1 year ago

how does wake word work? Does it keep listening and ignore if the last few seconds does not have the wake word/phrase?

knodi123|1 year ago

that's the general idea, yes. Or rather, store several chunks of audio, and discard the oldest. aka "Rolling window".

dasl|1 year ago

What latency do you get? I'd be interested in seeing a demo video.

nkaz123|1 year ago

Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.

zenkalia|1 year ago

How hard is it to make an offline model like this learn?

The readme mentions a memory that lasts as long as each conversation which seems like such a hard limitation to live with.

cmcconomy|1 year ago

Funny, I just picked up a device for use with https://heywillow.io for similar reasons

knodi123|1 year ago

me too, but I bricked mine when flashing the bios. just a fluke, nothing to be done about it.

8mobile|1 year ago

I really like having a voice assistant that focuses on privacy first. I will definitely try it. always add a video to show how it works. Thank you

ghnws|1 year ago

I didn't see a mention of languages in the readme. Does this understand languages other than english?

nkaz123|1 year ago

There are two models at use here, whisper tiny for transcribing audio, and then llama 3 for responding.

Whisper tiny is multi lingual (though I am using the english specific variant) and I believe llama 3 is technically capable of multi-lingual, but not sure of any benchmarks.

I think it could be made better, but for now focus is english. I'll add this to the readme though. Thanks!

timendum|1 year ago

The suggested model for vision capabilities is english only.

alexalx666|1 year ago

Can't wait until I can buy a licensed lieutenant Data voice pack inside chatGPT

nl|1 year ago

> It uses distributed models so latency is something I'm working on,

I think it uses local models, right?

kazinator|1 year ago

I would have hidden the name behind a bit of indirection by calling it Jean Luc or something.

ddingus|1 year ago

Thank you! This is a high value effort!