top | item 46957802

(no title)

Nice write-up — turn-taking is the whole game.

Two things that bit us building production voice agents: 1) “Barge‑in” feels broken unless you can cancel TTS + LLM immediately (sub‑second) and you treat partial STT hypotheses as first-class signals (not just final transcripts). A simple trick: trigger cancel on any sustained non-silence above a low threshold, then re-enable once you’ve seen N ms of silence. 2) Echo / duplex audio: if you don’t subtract your own TTS audio (or at least gate VAD while TTS is playing), you’ll get false user-starts. Even a crude ‘TTS playing → raise VAD threshold’ helps.

We’re building eboo.ai (voice agents w/ fast barge‑in + streaming orchestration) and ended up with a very similar architecture (telephony + STT + TTS co-located, everything streaming). If you’re curious, happy to compare notes on jitter buffers / geo placement and what’s worked in the wild.

discuss

No comments yet.