top | item 38611700

Show HN: Open-source macOS AI copilot using vision and voice

430 points| ralfelfving | 2 years ago |github.com | reply

Heeey! I built a macOS copilot that has been useful to me, so I open sourced it in case others would find it useful too.

It's pretty simple:

- Use a keyboard shortcut to take a screenshot of your active macOS window and start recording the microphone.

- Speak your question, then press the keyboard shortcut again to send your question + screenshot off to OpenAI Vision

- The Vision response is presented in-context/overlayed over the active window, and spoken to you as audio.

- The app keeps running in the background, only taking a screenshot/listening when activated by keyboard shortcut.

It's built with NodeJS/Electron, and uses OpenAI Whisper, Vision and TTS APIs under the hood (BYO API key).

There's a simple demo and a longer walk-through in the GH readme https://github.com/elfvingralf/macOSpilot-ai-assistant, and I also posted a different demo on Twitter: https://twitter.com/ralfelfving/status/1732044723630805212

159 comments

[+] e28eta|2 years ago|reply

Did you find that calling it “OSX” in the prompt worked better than macOS? Or was that just an early choice that you didn’t spend much time on?

I was skimming through the video you posted, and was curious.

https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

code link: https://github.com/elfvingralf/macOSpilot-ai-assistant/blob/...

[+] ralfelfving|2 years ago|reply

No, this is an oversight by me. To be completely honest, up until the other day I thought it was still called OSX. So the project was literally called cOSXpilot, but at some point I double checked and realize it's been called macOS for many years. Updated the project, but apparently not the code :)

I suspect OSX vs macOS has marginal impact on the outcome :)

[+] hot_gril|2 years ago|reply

Heh. I remember calling it Mac OS back in the day and getting corrected that it's actually OS X, as in "OS ten," and hasn't been called Mac OS since Mac OS 9. Glad Apple finally saw it my way (except it's cased macOS).

[+] jondwillis|2 years ago|reply

You should add an option for streaming text as the response instead of TTS. And also maybe text in place of the voice command as well. I have been tire-kicking a similar kind of copilot for awhile, hit me up on discord @jonwilldoit

[+] ralfelfving|2 years ago|reply

There's definitely some improvements to shuttling the data between interface<->API, all that was done in a few hours on day 1 and there's a few things I decided to fix later.

I prefer speaking over typing, and I sit alone, so probably won't add a text input anytime soon. But I'll hit you up on Discord in a bit and share notes.

[+] tomComb|2 years ago|reply

> text in place of the voice command as well

That would be great for people with Mac mini who don't have a mic.

[+] faceless3|2 years ago|reply

Wrote some similar scripts for my Linux setup, that I bind with XFCE keyboard shortcuts:

https://github.com/samoylenkodmitry/Linux-AI-Assistant-scrip...

F1 - ask ChatGPT API about current clipboard content F5 - same, but opens editor before asking num+ - starts/stops recording microphone, then passes to Whisper (locally installed), copies to clipboard

I find myself rarely using them however.

[+] ralfelfving|2 years ago|reply

Nice!

[+] Art9681|2 years ago|reply

Make sure to set OpenAI API spend limits when using this or you'll quickly find yourself learning the difference between the cost of the text models and vision models.

EDIT: I checked again and it seems the pricing is comparable. Good stuff.

[+] ralfelfving|2 years ago|reply

I think a prompt cost estimator might be a nifty thing to add to the UI.

Right now there's also a daily API limit on the Vision API too that kicks in before it gets really bad, 100+ requests depending on what your max spend limit is.

[+] krschacht|2 years ago|reply

I love it! I’ve been circling around a similar set of ideas, although my version integrates with the web-based ChatGPT:

https://news.ycombinator.com/item?id=38244883

There are some pros and cons to that. I’m intrigued by your stand-alone MacOS app.

[+] hackncheese|2 years ago|reply

Love it! Will definitely use this when a quick screenshot will help specify what I am confused about. Is there a way to hide the window when I am not using it? i.e. I hit cmd+shift+' and it shows the window, then when the response finishes reading, it hides again?

[+] ralfelfving|2 years ago|reply

There's a way for sure, it's just not implemented. Allowing for more configurability of the window(s) is on my list, because it annoys me too! :)

[+] poorman|2 years ago|reply

Currently imagining my productivity while waiting 10 seconds for the results of the `ls` command.

[+] ralfelfving|2 years ago|reply

It's a basic demo to show people how it works. I think you can imagine many other examples where it'll save you a lot of time.

[+] thomashop|2 years ago|reply

Just used it with the digital audio workstation Ableton Live. It is amazing! Its tips were spot-on.

I can see how much time it will save me when I'm working with a software or domain I don't know very well.

Here is the video of my interaction: https://www.youtube.com/watch?v=ikVdjom5t0E&feature=youtu.be

Weird these negative comments. Did people actually try it?

[+] ralfelfving|2 years ago|reply

So glad when I saw this, thanks for sharing this! It was exactly music production in Ableton was the spark that lit this idea in my head the other week. I tried to explain to a friend that don't use GPT much that with Vision, you can speed up your music production and learn how to use advanced tools like Ableton more quickly. He didn't believe me. So I grabbed a Ableton screenshot off Google and used ChatGPT -- then I felt there had to be a better way, I realized that I have my own use-cases, and it all evolved into this.

I sent him your video, hopefully he'll believe me now :)

[+] mikey_p|2 years ago|reply

Is it just me or is it incredibly useless?

"Here's a list of effects. Here's a list of things that make a song. Is it good? Yes. What about my drum effects? Yes here's the name of the two effects you are using on your drum channel"

None of this is really helpful and I can't get over how much it sounds like Eliza.

[+] pelorat|2 years ago|reply

I mean it does send a screenshot of your screen off to a 3rd party, and that screenshot will most likely be used in future AI training sets.

So... beware when you use it.

[+] unknown|2 years ago|reply

[deleted]

[+] rchaves|2 years ago|reply

Hey, I was working on something to allow GPT-V to actually do stuff on the screen, click around and type, I tested on my Mac and it’s working pretty well, do you think it would be cool to integrate? https://github.com/rogeriochaves/driver

[+] ralfelfving|2 years ago|reply

Yes. I think you commented this somewhere else, and I like it. I was considering doing something similar to have it execute keyboard commands, but decided it would have to wait for a future version. I think click + type + and performing other actions would be powerful, especially if it can do it fast and accurate. Then it's less about "How do I do X?", and more "Can you do X for me?".

[+] zmmmmm|2 years ago|reply

I've been wanting to build something like this by integrating into the terminal itself. Seems very straight forward and avoids the screen shotting. So you would just type a comment in the right format and it would recognise it:

    $ ls 
    a.txt b.txt c.txt

    $ # AI: concatenate these files and sort the result on the third column
    $ #....
    $ # cat a.txt b.txt c.txt | sort -k 3

This already works brilliantly by just pasting into CodeLLaMa so it's purely terminal integration to make it work. All i need is the rest of life to stop being so annoyingly busy.

[+] paulmedwards|2 years ago|reply

I wrote a simple command line app to let me quickly ask a quick question in the terminal - https://github.com/edwardsp/qq. It outputs the command I need and puts it in the paste buffer. I use it all the time now, e.g.

    $ qq concatenate all files in the current directory and sort the result on the third column
    cat * | sort -k3

[+] ukuina|2 years ago|reply

This is very cool! Thank you for working on it and sharing it with us.

[+] ralfelfving|2 years ago|reply

Thank you for checking it out! <3

[+] qup|2 years ago|reply

I have a tangential question: my dad is old. I would love to be able to have this feature, or any voice access to an LLM, available to him via an easy-to-press external button. Kind of like the big "easy button" from staples. Is there anything like that, that can be made to trigger a keypress perhaps?

[+] ralfelfving|2 years ago|reply

I personally have no experience with configuring or triggering keyboard shortcuts beyond what I learned and implemented in this project. But with that said, I'm very confident that what you're describing is not only possible but fairly easy.

[+] behat|2 years ago|reply

Nice! Built something similar earlier to get fixes from chatgpt for error messages on screen. No voice input because I don't like speaking. My approach then was Apple Computer Vision Kit for OCR + chatgpt. This reminds me to test out OpenAI's Vision API as a replacement.

Thanks for sharing!

[+] ralfelfving|2 years ago|reply

Thanks! You could probably grab what I have, and tweak it a bit. Try checking if you can screenshot just the error message and check what the value of the window.owner is. It should be the name of the application, so you could just append `Can you help me with this error I get in ${window.owner}?` to the Vision API call.

[+] I_am_tiberius|2 years ago|reply

I would love to have something like this but using an open source model and without any network requests.

[+] dave1010uk|2 years ago|reply

LLaVA, Whisper and a few bash scripts should be able to do it. I don't know how helpful the model is with screenshots though.

1. Download LLaVA from https://github.com/Mozilla-Ocho/llamafile

2. Run Whisper locally for speech to text

3. Save screenshots and send to the model, with a script like https://til.dave.engineer/openai/gpt-4-vision/

[+] trenchgun|2 years ago|reply

Probably in three months, approximately.

[+] dekhn|2 years ago|reply

I misread the title and thought this was an app you run on a laptop as you drive around... which if you think about it, would be pretty useful. A combined vision/hearing/language model with access to maps, local info, etc.

[+] ralfelfving|2 years ago|reply

It would be really cool, and I think we're not very far away from this being something you have on your phone.

The pilot name comes from Microsoft's use of "Copilot" for their AI assistant products, and I tried to play on it with macOSpilot which is maco(s)pilot. I think that naming has completely flown over everyone's heads :D

[+] smcleod|2 years ago|reply

Nice project, any plans to make it work with local LLMs rather than "open"AI?

[+] ralfelfving|2 years ago|reply

Thanks. Had no plans, but might give it a try at some point. For me, personally, using OpenAI for this isn't an issue.

[+] hmottestad|2 years ago|reply

I think that LM Studio has an OpenAI "compliant" API, so if there is something similar that supports vision+text then it would be easy enough to make the base URL configurable and then point it to localhost.

Do you know of a simple setup that I can run locally with support for both images and text?

[+] kssreeram|2 years ago|reply

People reading this should check out Iris[1]. I’ve been using it for about a month, and it’s the best macOS GPT client I’ve found.

[1]: https://iris.fun/

[+] LeoPanthera|2 years ago|reply

Oof, $20/month is a lot, when I already have my own OpenAI API key.

[+] mdrzn|2 years ago|reply

I wish there was something like this for Windows!

[+] d4rkp4ttern|2 years ago|reply

I’ve looking for a simple way to use voice input on the main ChatGPT website, since it gets tiresome to type a lot of text into it. Anyone have recommendations? The challenge is to get technical words right.

[+] ralfelfving|2 years ago|reply

If you're ok with it, you can use the mobile app -- it supports voice. Then you just have the same chat/thread open on your computer in case you need to copy/paste something.

[+] pyryt|2 years ago|reply

Do you have use case demo videos somewhere? Would be great to see this in action

[+] ralfelfving|2 years ago|reply

There's one at 00:30 in this YouTube video (timestamped the link): https://www.youtube.com/watch?v=1IdCWqTZLyA&t=32s

[+] quinncom|2 years ago|reply

I’d love to see a version of this that uses text input/output instead of voice. I often have someone sleeping in the room with me and don’t want to speak.

[+] ralfelfving|2 years ago|reply

Added the text input option today.

[+] ralfelfving|2 years ago|reply

You're not the first to request. Might add it, can't promise tho.