top | item 46629727

(no title)

blutoot | 1 month ago

I have dystonia which often stiffens my arms in a way that makes it impossible for me to type on a keyboard. TTS apps like SuperWhisper have proven to be very helpful for me in such situations. I am hoping to get a similar experience out of "Handy" (very apt maming from my perspective).

I do, however, wonder if there is a way all these TTS tools can get to the next level. The generated text should not be just a verbatim copy of what I just said, but depending on the context, it should elaborate. For example, if my cursor is actively inside an editor/IDE with some code, my coding-related verbal prompts should actually generate the right/desired code in that IDE.

Perhaps this is a bit of combining TTS with computer-use.

discuss

mritchie712|1 month ago

I made something called `ultraplan`. It's is a CLI tool that records multi-modal context (audio transcription via local Whisper, screenshots, clipboard content, etc.) into a timeline that AI agents like Claude Code can consume.

I have a claude skill `/record` that runs the CLI which starts a new recording. I debug, research, etc., then say "finito" (or choose your own stopword). It outputs a markdown file with your transcribed speech interleaved with screenshots and text that you copied. You can say other keywords like "marco" and it will take a screenshot hands-free.

When the session ends, claude reads the timeline (e.g. looks at screenshots) and gets to work.

I can clean it up and push to github if anyone would get use out of it.

mritchie712|1 month ago

https://github.com/definite-app/ultraplan

heliostatic|1 month ago

Definitely interested in that!

wanderingmind|1 month ago

Sounds interesting I would love to use it if you get a chance to push to github

sipjca|1 month ago

I totally agree with you and largely what you’re describing is one of the reasons I made Handy open source. I really want to see something like this and see someone go experiment with making it happen. I did hear some people playing with using some small local models (moondream, qwen) to get some more context of the computer itself

I initially had a ton of keyboard shortcuts in handy for myself when I had a broken finger and was in a cast. It let me play with the simplest form of this contextual thing, as shortcuts could effectively be mapped to certain apps with very clear uses cases

eddyg|1 month ago

There’s lots of existing work on “coding by voice” long before LLMs were a thing. For example (from 2013): http://xahlee.info/emacs/emacs/using_voice_to_code.html and the associated HN discussion (“Using Voice to Code Faster than Keyboard”): https://news.ycombinator.com/item?id=6203805

There’s also more recent-ish research, like https://dl.acm.org/doi/fullHtml/10.1145/3571884.3597130

hasperdi|1 month ago

What you said is possible by feeding the output of speech-to-text tools into an LLM. You can prompt the LLM to make sense of what you're trying to achieve and create sets of actions. With a CLI it’s trivial, you can have your verbal command translated into working shell commands. With a GUI it’s slightly more complicated because the LLM agent needs to know what you see on the screen, etc.

That CLI bit I mentioned earlier is already possible. For instance, on macOS there’s an app called MacWhisper that can send dictation output to an OpenAI‑compatible endpoint.

sipjca|1 month ago

Handy can post process with LLMs too! It’s just currently hidden behind a debug menu as an alpha feature (ctrl/cmd+shift+d)

ryanshrott|1 month ago

[deleted]