I built something similar for Linux (yapyap — push-to-talk with whisper.cpp). The "local is too slow" argument doesn't hold up anymore if you have any GPU at all. whisper large-v3-turbo with CUDA on an RTX card transcribes a full paragraph in under a second. Even on CPU, parakeet is near-instant for short utterances.The "deep context" feature is clever, but screenshotting and sending to a cloud LLM feels like massive overkill for fixing name spelling. The accessibility API approach someone mentioned upthread is the right call — grab the focused field's content, nearby labels, window title. That's a tiny text prompt a 3B local model handles in milliseconds. No screenshots, no cloud, no latency.The real question with Groq-dependent tools: what happens when the free tier goes away? We've seen this movie before. Building on local models is slower today but doesn't have a rug-pull failure mode.
Yeah local works really fine. I tried this other tool: https://github.com/KoljaB/RealtimeVoiceChat which allows you to live chat with a (local) LLM. With local whisper and local LLM (8b llama in my case) it works phenomenally and it responds so quickly that it feels like it's interrupting me.
Too bad that tool no longer seems to be developed. Looking for something similar. But it's really nice to see what's possible with local models.
FWIW whisper.cpp with the default model works at 6x realtime transcription speed on my four-core ~2.4GHz laptop, and doesn't really stress CPU or memory. This is for batch transcribing podcasts.
The downside is that couldn't get it to segment for different speakers. The concensus seemed to be to use a separate tool.
Big fan of handy and it’s cross platform as well. Parakeet V3 gives the best experience with very fast and accurate-enough transcriptions when talking to AIs that can read between the lines. It does have stuttering issues though. My primary use of these is when talking to coding agents.
But a few weeks ago someone on HN pointed me to Hex, which also supports Parakeet-V3 , and incredibly enough, is even faster than Handy because it’s a native MacOS-only app that leverages CoreML/Neural Engine for extremely quick transcriptions. Long ramblings transcribed in under a second!
I just learned about Handy in this thread and it looks great!
I think the biggest difference between FreeFlow and Handy is that FreeFlow implements what Monologue calls "deep context", where it post-processes the raw transcription with context from your currently open window.
This fixes misspelled names if you're replying to an email / makes sure technical terms are spelled right / etc.
The original hope for FreeFlow was for it to use all local models like Handy does, but with the post-processing step the pipeline took 5-10 seconds instead of <1 second with Groq.
Yes, I also use Handy. It supports local transcription via Nvidia Parakeet TDT2, which is extremely fast and accurate. I also use gemini 2.5 flash lite for post-processing via the free AI studio API (post-processing is optional and can also use a locally-hosted LM).
I didn't try Handy but been using Whisper-Key its super simple get out of your way all local single file executable (portable so zero install too) -- thats for Windows idk about the Mac version
Handy rocks. I recently had minor surgery on my shoulder that required me to be in a sling for about a month, and I thought I'd give Handy a try for dictating notes and so on. It works phenomenally well for most text-to-speech use cases - homonyms included.
Handy's great! I find the latency to be just a bit too much for my taste. Like half the people on this thread, built my own but with a bit more emphasis on speed
Not sure if it's just me but Handy crashes on my Arch setup. Never mind which version I run. Could be something with Wayland or Pipewire but didn't see anything obvious in the logs.
I used to use VoiceInk, but I found Spokenly [0] to be easier to use for post-processing the output, and more stable overall (local version with Parakeet or whisper is free).
I found a Linux version with a similar workflow and forked it to build the Mac version. It look less than 15 mins to ask Claude to modify it as per my needs.
Yeah, it's really that simple. I have tried various applications as well and keep coming back to my custom script because when a new voice model drops on HuggingFace it becomes possible to customize it immediately - rather than wait for that application developer to support that new model.
Okay starting point, but that last two only works on X11. Considering it's 2026, I really don't think a guide for someone wanting to make a speech-to-text app should be recommending X11.
Since many are asking about apps with simillar capabilities I’m very happy with MacWhisper. Has Parakeet, near instant transcription of my lengthy monologues. All local.
Edit: Ah but Parakeet I think isn’t available for free. But very worthwhile single purchase app nonetheless!
I actually got MacWhisper originally for speech to text so I could talk to my machine like a crazy person. I realized I didn't like doing that but the actual killer feature for buying it that I really enjoy is the fully local transcription of meetings, with a nice little button to start recording that pops up when you launch zoom, teams, etc. It means I can safely record meetings and encrypt them locally and keep internal notes without handing off all of that to some nebulous cloud platform.
I had previously used Hyprnote to record meetings in this way - and indeed I still use that as a backup, it's a great free option - but the meeting prompting to record and better transcription offered by Macwhisper is a much better experience.
Sounds like there's plenty of interest in those kind of tools. I'm not a huge fun API transcriptions given great local models.
I build https://github.com/bwarzecha/Axii to keep EVERYTHING locally and be fully open source - can be easily used at any company. No data send anywhere.
Does any of these solutions work reliably for non-English languages? I’ve had a lot of issues trying to transcribe Swedish with all the products I’ve used so far.
Try ottex with Gemini 3 flash as a transcription model. I'm bilingual as well and frequently switch between languages - Gemini handles this perfectly and even the case when I speak two languages in one transcription.
for me it strikes the balance of good, fast, and cheap for everyday transcription. macwhisper is overkill, superwhisper too clever, and handy too buggy. hex fits just right for me (so far)
Tried to use it, installed, enabled permissions, downloaded the parakeet model for English and then it crashed every time I released the button after dictating. Completely unusable.
Just Text to speech seems like its largely solved on pretty much every compute platform. However I have found a huge gap going from independent words being transcribed, to formatted text ready for an editor, or further processing.
If you look at how authors dictate they works ( which they have done for millennia), just getting the words written down is only the first step, and its by far the easiest. I have been helping build a tool https://bookscribe.ai that not only does the transcription, but then can post process it to make it actually usable for longer form content.
I just vibe coded a my own NaturalReader replacement. The subscription was $110/year... and I just canceled it.
Chatterbox TTS (from Resemble AI) does the voice generation, WhisperX gives word-level timestamps so you can click any word to jump, and FastAPI ties it all together with SSE streaming so audio starts playing before the whole thing is done generating.
There's a ~5s buffer up front while the first chunk generates, but after that each chunk streams in faster than realtime. So playback rarely stalls.
For those using something like this daily, what key combinations do you use to record and cancel. I’m using my capslock right now but was curious about others
Someone told me the other day I should use a foot pedal, and then I remembered I already had an Elgato one under my desk connected with my Stream Deck. I got it very cheap used on eBay. So, that's an option too.
Scroll Lock is really good key for that in my opinion. If your keyboard does not have it exposed then you can use some remapping program like https://github.com/jtroo/kanata
I have a Stream Deck and made a dedicated button for this. So I tap the button speak and then tap it again and it pastes into wherever my cursor was at.
And then I set the button right below that as the enter key so it feels mostly handsoff the keyboard.
I'm building in the same space, Workin on https://ottex.ai - It's a free STT app, with local models and BYOK support (OpenRouter, Groq, Mistral, and more).
The top feature is the per-app custom settings - you can peak different models and instructions for different apps and websites.
- I use the Parakeet fast model when working with Claude Code (VS Code app).
- And I use a smart one when I draft notes in Obsidian. I have a prompt to clean up my rambling and format the result with proper Markdown, very convenient.
One more cool thing is that it allows me to use LLMs with audio input modalities directly (not as text post-processing). e.g. It sends the audio to Gemini and prompts it to transcribe, format, etc., in one run. I find it a bit slow to work with CC, but it is the absolute best model in terms of accuracy, understanding, and formatting. It is the only model I trust to understand what I meant and produce the correct result, even when I use multiple languages, tech terms, etc.
Interesting, but I quickly uninstalled it after (1) it asked for permission to record keystrokes across all application and (2) registered global keyboard shortcut Option+Space without asking me.
Could you make it use Parakeet? That's an offline model that runs very quickly even without a GPU, so you could get much lower latency than using an API.
I love this idea, and originally planned to build it using local models, but to have post-processing (that's where you get correctly spelled names when replying to emails / etc), you need to have a local LLM too.
If you do that, the total pipeline takes too long for the UX to be good (5-10 seconds per transcription instead of <1s). I also had concerns around battery life.
I installed Whisper+ through FDroid and it works well for my basic needs. Only 30s at a time but you can append multiple recordings to the same transcript: https://github.com/woheller69/whisperIMEplus
I have been using VoiceFlow. It works incredibly well and uses Groq to transcribe using the Whisper V3 Turbo model. You can also use it in an offline scenario with an on-device model, but I am mostly connected to the internet whenever I am transcribing.
Mine was only tested on an Arc GPU (the acceleration works nicely through Vulkan). It hooks into Win32 API and simulates key presses so it works in various non-obvious contexts.
I created Voibe which takes a slightly different direction and uses gpt-4o-transcribe with a configurable custom prompt to achieve maximum accuracy (much better than Whisper). Requires your own OpenAI API key.
The moat here is local inference. Whisper.cpp + Metal gives you <500ms latency on M1 with the small model. no API costs + no privacy concerns. Ship that and you've got something the paid tools can't match. The UI is already solid, the edge is in going fully offline.
Does anyone know of any macos transcription apps that allow you to do speech to text live? Eg, the text outputs as you are talking? Older tech like the macos dictation as well as dragon does this, but seems like theres nothing available that uses the new, better models.
I dont understand who this is for honestly. Unless you dont have hands, why would you want to talk to your computer. Maybe Im just autistic, but I would always prefer text over speaking out and have that translate to text.
Some of us have hands (and wrists and arms) that are dealing with RSI. Keyboard use reduction is very important in these cases.
Greg Priest Dorman [0][1] had other physical issues such that he had to regularly switch between sitting, standing, and walking during his workday. His solutions included (in part) some very specialized keypads, but TTS might well have been another solution for someone with similar needs.
Another fellow on my team refuses to write/type anything other than pure code to solve issues at work, but will absolutely talk for hours on end about designs, considerations, issues, what-have-you, so we're actively trying to get him to adopt a TTS-based workflow for knowledge transfer, writing tickets/bugs, etc.
Handy appears to keep the audio clips, It does have a section in the settings to limit how many of those it keeps and there does not appear to be an upper limit, but it does have to be manually set. (I set mine to 99,999).
It would be nice if below 0 it had a -1 option to keep all recordings.
Quick glance; FreeFlow already saves WAV recordings for every transcript to ~/Lib../App../FreeFlow/audio/ with UUIDs linking them to pipeline history entries in CoreData. Audio files are automatically deleted though, when their associated history entries are deleted. Shall be a quick fix. Recently did the same for hyprvoice, for debugging and auditing.
Quick question, what’s the state of vibe coding with Xcode? I remember there were some issues months ago trying to get a seem less integration working. Has it improved?
I just vibe coded a small app in 2 hours (mostly due to making additional adjustments). Claude Code used Xcode CLI. No issues besides the fact that it’s not notarized and you have to trust it through Gatekeeper
Why do people find the need to market as "free alternative to xyz" when its a basic utility? I take it as an instant signal that the dev is a copycat and mostly interested in getting stars and eyeballs rather than making a genuinely useful high quality product.
Really good to know Handy exists; it's the first I'm hearing about it. I use a speech-to-text app that I built for myself, and I know at least one co-worker pays $10 a month for (I think) Wispr. I think it's possible there was no intention to market, and the creator simply didn't know about Handy, just like me.
Another free option: Mellon (voice.mellon.chat) — fully local on Mac, no cloud, BYOK. Custom dictionary + phonetic corrections so it actually gets your technical terms right.
Also has an OpenClaw integration if anyone's using that for AI agents.
lxe|13 days ago
wolvoleo|12 days ago
Too bad that tool no longer seems to be developed. Looking for something similar. But it's really nice to see what's possible with local models.
Wowfunhappy|13 days ago
By "any GPU" you mean a physical, dedicated GPU card, right?
That's not a small requirement, especially on Macs.
wazoox|12 days ago
h3lp|12 days ago
The downside is that couldn't get it to segment for different speakers. The concensus seemed to be to use a separate tool.
BatteryMountain|12 days ago
digitalbase|13 days ago
d4rkp4ttern|13 days ago
But a few weeks ago someone on HN pointed me to Hex, which also supports Parakeet-V3 , and incredibly enough, is even faster than Handy because it’s a native MacOS-only app that leverages CoreML/Neural Engine for extremely quick transcriptions. Long ramblings transcribed in under a second!
It’s now my favorite fully local STT for MacOS:
https://github.com/kitlangton/Hex
zachlatta|13 days ago
I think the biggest difference between FreeFlow and Handy is that FreeFlow implements what Monologue calls "deep context", where it post-processes the raw transcription with context from your currently open window.
This fixes misspelled names if you're replying to an email / makes sure technical terms are spelled right / etc.
The original hope for FreeFlow was for it to use all local models like Handy does, but with the post-processing step the pipeline took 5-10 seconds instead of <1 second with Groq.
hendersoon|13 days ago
jimmySixDOF|12 days ago
[1] https://github.com/PinW/whisper-key-local
vogtb|13 days ago
gurjeet|13 days ago
Surprisingly, it produced a better output (at least I liked its version) than the recommended but heavy model (Parakeet V3 @ 478 MB).
arach|13 days ago
https://usetalkie.com
odiroot|12 days ago
irrationalfab|13 days ago
stavros|13 days ago
smcleod|13 days ago
p0w3n3d|13 days ago
https://github.com/Beingpax/VoiceInk
parhamn|13 days ago
jiehong|12 days ago
[0]: https://spokenly.app/
james2doyle|12 days ago
arach|13 days ago
sahildeepreel|12 days ago
just bought the one-time licence. this is the future of AI pricing - local models and one-time fee.
zackify|13 days ago
sathish316|13 days ago
F12 -> sox for recording -> temp.wav -> faster-whisper -> pbcopy -> notify-send to know what’s happening
https://github.com/sathish316/soupawhisper
I found a Linux version with a similar workflow and forked it to build the Mac version. It look less than 15 mins to ask Claude to modify it as per my needs.
F12 Press → arecord (ALSA) → temp.wav → faster-whisper → xclip + xdotool
https://github.com/ksred/soupawhisper
Thanks to faster-whisper and local models using quantization, I use it in all places where I was previously using Superwhisper in Docs, Terminal etc.
archb|12 days ago
elxr|11 days ago
Okay starting point, but that last two only works on X11. Considering it's 2026, I really don't think a guide for someone wanting to make a speech-to-text app should be recommending X11.
vesterde|13 days ago
Edit: Ah but Parakeet I think isn’t available for free. But very worthwhile single purchase app nonetheless!
SOLAR_FIELDS|13 days ago
I had previously used Hyprnote to record meetings in this way - and indeed I still use that as a backup, it's a great free option - but the meeting prompting to record and better transcription offered by Macwhisper is a much better experience.
kombinar|13 days ago
I build https://github.com/bwarzecha/Axii to keep EVERYTHING locally and be fully open source - can be easily used at any company. No data send anywhere.
unknown|13 days ago
[deleted]
strokirk|13 days ago
u_sama|12 days ago
If you are willing to use a service for transcriptions, Mistral (which is also European) works rather nicely if they support your language https://docs.mistral.ai/capabilities/audio_transcription#tra...
k9294|12 days ago
threekindwords|13 days ago
https://github.com/kitlangton/Hex
for me it strikes the balance of good, fast, and cheap for everyday transcription. macwhisper is overkill, superwhisper too clever, and handy too buggy. hex fits just right for me (so far)
shostack|13 days ago
dkhenry|12 days ago
If you look at how authors dictate they works ( which they have done for millennia), just getting the words written down is only the first step, and its by far the easiest. I have been helping build a tool https://bookscribe.ai that not only does the transcription, but then can post process it to make it actually usable for longer form content.
pegasus|12 days ago
rabf|13 days ago
My take for X11 Linux systems. Small and low dependency except for the model download.
Fidelix|13 days ago
spelk|13 days ago
[0] https://github.com/EpicenterHQ/epicenter
properbrew|13 days ago
https://blazingbanana.com/work/whistle
9999gold|13 days ago
drooby|13 days ago
Chatterbox TTS (from Resemble AI) does the voice generation, WhisperX gives word-level timestamps so you can click any word to jump, and FastAPI ties it all together with SSE streaming so audio starts playing before the whole thing is done generating.
There's a ~5s buffer up front while the first chunk generates, but after that each chunk streams in faster than realtime. So playback rarely stalls.
It took about 4 hours today... wild.
bawana|12 days ago
muratsu|13 days ago
qingcharles|13 days ago
Doman|13 days ago
michaelbuckbee|13 days ago
And then I set the button right below that as the enter key so it feels mostly handsoff the keyboard.
atestu|13 days ago
adanto6840|13 days ago
Brajeshwar|13 days ago
knob|13 days ago
k9294|12 days ago
The top feature is the per-app custom settings - you can peak different models and instructions for different apps and websites.
- I use the Parakeet fast model when working with Claude Code (VS Code app). - And I use a smart one when I draft notes in Obsidian. I have a prompt to clean up my rambling and format the result with proper Markdown, very convenient.
One more cool thing is that it allows me to use LLMs with audio input modalities directly (not as text post-processing). e.g. It sends the audio to Gemini and prompts it to transcribe, format, etc., in one run. I find it a bit slow to work with CC, but it is the absolute best model in terms of accuracy, understanding, and formatting. It is the only model I trust to understand what I meant and produce the correct result, even when I use multiple languages, tech terms, etc.
tomashm|8 days ago
vittore|13 days ago
Void_|13 days ago
stranded22|13 days ago
https://mistral.ai/news/voxtral-transcribe-2
wazoox|12 days ago
arcologies1985|13 days ago
zachlatta|13 days ago
If you do that, the total pipeline takes too long for the UX to be good (5-10 seconds per transcription instead of <1s). I also had concerns around battery life.
Some day!
s0l|13 days ago
It’s free and offline
spelk|13 days ago
windthrown|13 days ago
jskherman|13 days ago
uncharted9|13 days ago
xnx|13 days ago
paweladamczuk|12 days ago
https://github.com/PawelAdamczuk/blah
Mine was only tested on an Arc GPU (the acceleration works nicely through Vulkan). It hooks into Win32 API and simulates key presses so it works in various non-obvious contexts.
corlinp|13 days ago
https://github.com/corlinp/voibe
I do see the name has since been taken by a paid service... shame.
johnbatch|13 days ago
arach|13 days ago
native app uses Parakeet (v2 or V3) on iOS
copperx|13 days ago
seyz|12 days ago
yrral|13 days ago
zuInnp|12 days ago
Laurenz1337|12 days ago
zhengyi13|12 days ago
Greg Priest Dorman [0][1] had other physical issues such that he had to regularly switch between sitting, standing, and walking during his workday. His solutions included (in part) some very specialized keypads, but TTS might well have been another solution for someone with similar needs.
Another fellow on my team refuses to write/type anything other than pure code to solve issues at work, but will absolutely talk for hours on end about designs, considerations, issues, what-have-you, so we're actively trying to get him to adopt a TTS-based workflow for knowledge transfer, writing tickets/bugs, etc.
[0]: https://computerhistory.org/profile/greg-priest-dorman/ [1]: https://www.cs.vassar.edu/people/priestdo/wearables/top
hackernewds|12 days ago
sonu27|13 days ago
manmal|13 days ago
baxtr|13 days ago
BizarroLand|12 days ago
It would be nice if below 0 it had a -1 option to keep all recordings.
https://handy.computer/
heyalexej|13 days ago
yuppiepuppie|13 days ago
b-star|12 days ago
hodanli|13 days ago
lemming|13 days ago
arach|12 days ago
dcreater|13 days ago
Just use handy: https://github.com/cjpais/Handy
egonschiele|13 days ago
SomaticPirate|13 days ago
copperx|13 days ago
dan_wood|13 days ago
hnrodey|13 days ago
unknown|13 days ago
[deleted]
unknown|13 days ago
[deleted]
unknown|13 days ago
[deleted]
DevX101|13 days ago
xavpoon|9 days ago
Also has an OpenClaw integration if anyone's using that for AI agents.
Full disclosure: I built it.
Zopieux|13 days ago
Won't be free when xAI starts charging.
properbrew|13 days ago
https://blazingbanana.com/work/whistle
setnone|13 days ago
ndgold|12 days ago
_blackhawk_|13 days ago
copperx|13 days ago
hurios|10 days ago
[deleted]
anvevoice|13 days ago
[deleted]