(no title)
PranayKumarJain | 6 days ago
A couple questions / thoughts from building voice agents in production:
- How do you handle barge‑in / interruptions? With <Gather input="speech"> + polling, it’s hard to do true full‑duplex + partial ASR. Have you considered a hybrid mode where you keep the TwiML simplicity for setup, but optionally switch to <Stream> (Media Streams) when people want sub‑second turn-taking? - Twilio’s built-in speech recog is convenient, but in my experience it can be the first thing teams outgrow (accuracy, language coverage, costs, and lack of token-level partials). Do you expose an interface so people can swap STT later without reworking the call control? - For long agent responses: do you chunk <Say> / keep call alive with <Pause>? Any gotchas around Twilio timeouts while the agent is “thinking”?
We’ve run into the same infra-vs-latency tradeoff at eboo.ai (real-time voice agents / telephony + WebRTC). If you ever want a sanity check on the lowest-latency Twilio path (Media Streams + incremental STT + barge-in), happy to compare notes.
ranacseruet|5 days ago
About barge-in/interruptions, we have partial support. You can look at the codebase and/or the documentations we have for architecture, research as well as what's being planned to address etc: https://github.com/ranacseruet/clawphone/tree/main/docs . Feel free to engage on the repo through issue tracking/suggestions etc.
Hope that helps. Thanks!