Show HN: VoxConvo – "X but it's only voice messages"
10 points| siim | 3 months ago |voxconvo.com
I saw this tweet: "Hear me out: X but it's only voice messages (with AI transcriptions)" - and couldn't stop thinking about it.
So I built VoxConvo.
Why this exists:
AI-generated content is drowning social media. ChatGPT replies, bot threads, AI slop everywhere.
When you hear someone's actual voice: their tone, hesitation, excitement - you know it's real. That authenticity is what we're losing.
So I built a simple platform where voice is the ONLY option.
The experience:
Every post is voice + transcript with word-level timestamps:
Read mode: Scan the transcript like normal text or listen mode: hit play and words highlight in real-time.
You get the emotion of voice with the scannability of text.
Key features:
- Voice shorts
- Real-time transcription
- Visual voice editing - click a word in transcript deletes that audio segment to remove filler words, mistakes, pauses
- Word-level timestamp sync
- No LLM content generation
Technical details:
Backend running on Mac Mini M1:
- TypeGraphQL + Apollo Server
- MongoDB + Atlas Search (community mongo + mongot)
- Redis pub/sub for GraphQL subscriptions
- Docker containerization for ready to scale
Transcription:
- VOSK real time gigaspeech model eats about 7GB RAM
- WebSocket streaming for real-time partial results
- Word-level timestamp extraction plus punctuation model
Storage:
- Audio files are stored to AWS S3
- Everything else is local
Why Mac Mini for MVP? Validation first, scaling later. Architecture is containerized and ready to migrate. But I'd rather prove demand on gigabit fiber than burn cloud budget.
1bpp|3 months ago
SrslyJosh|3 months ago
layman51|3 months ago
siim|3 months ago
teunlao|3 months ago
Clubhouse lost 93% of users from peak. WhatsApp sends 7 billion voice messages daily - but those are DMs, not feeds.
The math doesn't work: reading is 50-80% faster than listening. You can skim 50 text posts in 100 seconds. 50 voice posts? 15 minutes.
Voice works async 1-to-1. You built Twitter where every tweet is a 30-second voicemail nobody has time to listen to.
The transcription proves it - users will read, not listen. Which makes this "text feed with worse UX"
siim|3 months ago
Reading > listening for consumption.
Talk to create, read to consume.
zahlman|3 months ago
> Why this exists: AI-generated content is drowning social media.
> Real-time transcription
... So you want to filter out AI content by requiring users to produce audio (not really any harder for AI than text), and you add AI content afterward (the transcriptions) anyway?
I really think you should think this through more.
The "authenticity" problem is fundamentally about how users discover each other. You get flooded with AI slop because the algorithm is pushing it in front of you. And that algorithm is easily gamed, and all the existing competitors are financially incentivized to implement such an algorithm and not care about the slop.
Also, I looked at the page source and it gives a strong impression that you are using AI to code the project and also that your client fundamentally works by querying an LLM on the server. It really doesn't convey the attitude supposedly motivating the project.
Nice tech demo though, I guess.
siim|3 months ago
To clarify:
1. transcription is local VOSK speech-to-text via WebSocket
2. live transcript post-processing has optional Gemini Flash-lite turned on which tries to fix obvious transcription mistakes, nothing else. The real fix here is more accurate transcriber.
3. backend: TypeGraphQL + MongoDB + Redis
The anti-AI stance isn't "zero AI anywhere", it's about requiring human input.
AI-generated audio is either too bad or too perfect. Real recorded voice has human imperfections.
monadoid|3 months ago
cdrini|3 months ago
siim|3 months ago
oulipo2|3 months ago
esafak|3 months ago
I feel like you are making your users jump through hoops to do bot and slop detection, when you ought to be investing in technology to do the same. Here is a focusing question: would you still demand audio recordings if you had that technology?
Maybe you will court an interesting set of users when you do this? I just know I will not be one of them; ain't got time for that. Good luck.
cjflog|3 months ago
jagged-chisel|3 months ago
:grimace:
Sorry, but I have to pass.