top | item 47060832

Show HN: I got frustrated with macOS transcription apps so I built my own

2 points| Neolio | 12 days ago |whisnap.com

Every local speech-to-text app I tried had the same problems. Files getting stuck mid-transcription with no way to retry. Retranscribing was either gated or buggy. No fallback when something fails - just a silent failure and your recording(I tend to talk for 5-10mins) is gone.

So I built Whisnap. Hold a hotkey, talk, release - text just appears where your cursor is. Local Whisper with Metal on Apple Silicon, nothing leaves your machine if you don't want it.

I built fallbacks on top of fallbacks. If a model can't process your audio, it tries a different one. You can always retranscribe a recording. Even the cloud mode (optional) has its own fallback chain: WebSocket streaming to batch upload to local Whisper. Something always works.

One thing I spent a bunch of time on: a post-processing pipeline for Whisper's hallucination problem. Anyone who's worked with Whisper knows it hallucinates "Thanks for watching, don't forget to like and subscribe" from silent audio, or loops the same phrase endlessly. The filter handles bracketed artifacts, known hallucination phrases, word repetition, sentence loops, and cross-text deduplication. Not perfect, but catches most of it.

The same binary also works as a CLI, "whisnap recording.wav" just works. I run an AI agent (OpenClaw) on the same Mac and instead of paying for ElevenLabs or other cloud transcription APIs, it just calls Whisnap's CLI and gets clean text back. Same models, no extra setup.

Stack: Tauri v2, whisper-rs, RNNoise for denoising, SIMD audio mixing, rubato resampling.

It's free, Mac only for now. Would love to know if the hallucination filter holds up for anyone else's use cases. https://whisnap.com/

3 comments

order

Leftium|12 days ago

> just a silent failure and your recording (I tend to talk for 5-10mins) is gone

One of the reasons for my streaming transcription app: https://rift-transcription.vercel.app

- You see results in less than a second as you talk.

My app also supports multimodal input: interleave talking with typing. (Click the Replay" button to see a color-coded demo.)

Supports local models (with a little setup: https://rift-transcription.vercel.app/local-setup)

Neolio|6 days ago

Since it's on macos, I use optimized both whisper and parakeet, based on if you need accuracy. Parakeet is almost real-time.