Show HN: Utter, a local-first dictation app for Mac and iPhone
So I built Utter.
The main idea is simple: it should work as both a fast dictation tool and a longer-form voice note / meeting capture tool, while still giving the user control over where processing happens and what the output looks like.
A few things it does today:
- global dictation with customizable shortcuts
- saved modes for different workflows, with different prompts/models
- remembers the last mode used per app
- meeting recording with speaker-labeled transcripts, summaries, and action items
- file transcription for audio/video
- saved audio/transcripts with export options
- prompt-based post-processing for turning raw speech into notes, messages, summaries, etc.
- built-in note editor
- iPhone app with dictation keyboard
A big motivation was being able to use it locally. It supports local transcription, optional local post-processing, BYOK, or cloud providers depending on the workflow. I also wanted phone-to-desktop capture to feel simple, so it syncs through iCloud and doesn’t require an account.Curious to hear from people who use dictation heavily, especially on:
- where current dictation tools still fall short
- whether the “modes” idea makes sense in practice
[+] [-] Leftium|2 days ago|reply
This one is unique in that it supports iPhone. I haven't seen mobile support very often.
Despite all these apps, there are two things holding me back from using a dictation app on a regular basis:
- streaming transcription: see words in realtime
- multimodal input: mix voice with keyboard
So I started prototyping this type of realtime multimodal dictation UX: https://rift-transcription.vercel.app
This HN comment captures why streaming is important for transcription: https://hw.leftium.com/#/item/47149479
[+] [-] hubab|2 days ago|reply
On multimodal input, the UX you’re prototyping where you switch between dictating and typing while composing is interesting. I haven’t really seen that approach before.
The direction I took is a bit different. Instead of mixing modalities mid-composition, dictation becomes context-aware during post-processing. Selected/Copied text or surrounding field content can be inserted into the post-processing prompt so the spoken input is interpreted relative to what’s already on screen.
[+] [-] r0fl|3 days ago|reply
How is this different than me using the voice to speech feature on my iPhone or Mac that is built in, and free? I can talk into voice memos as well and get a full transcript even from crazy long files
Thanks
[+] [-] hubab|3 days ago|reply
Utter uses GPT-4o Transcribe by default for cloud transcription, and in my experience it’s best in class. The gap is most obvious on names, niche terminology, and technical vocabulary. I use it a lot for prompting coding agents, and I've found Apple’s built-in dictation and most other apps don't come close in terms of accuracy.
It also adds a custom post-processing step. So instead of ending up with a raw transcript, you can record a long, messy voice note and have it turned into a clean, structured markdown notes.
If you want to test the accuracy difference yourself, try dictating this with both Apple dictation and ChatGPT web (uses same model) and compare the output:
“My FastAPI service uses Pydantic, Celery, Redis, and SQLAlchemy, but the async worker is deadlocking when a background task retries after a Postgres connection pool timeout.”