Building a voice agent that feels like a human is 20% model quality and 80% orchestration. The "standard" approach—daisy-chaining STT, LLM, and TTS APIs—usually results in a 2-3 second delay that kills the conversation. We also found that implementing "Barge-in" (the ability to interrupt the bot) is surprisingly tricky to get right across multiple streaming providers.
We chose Go because voice orchestration is essentially a high-concurrency plumbing problem. You’re managing several bidirectional streams (WebSockets/gRPC) while calculating RMS for VAD (Voice Activity Detection) and managing a state machine that needs to respond in milliseconds when it detects user speech.
What’s inside:
Full-Duplex: Capture and playback occur simultaneously without audio feedback loops.
Native Barge-in: When the user starts speaking, the orchestrator immediately kills the LLM generation and clears the TTS audio buffers.
Built-in RMS VAD: Thread-safe voice activity detection out of the box.
Provider Agnostic: Swap between Groq, OpenAI, Deepgram, Anthropic, and our own Versa engine.
Minimal Latency: Designed to add <10ms of overhead on top of the provider latencies.
We've used this to build agents that handle sub-500ms end-to-end response times. We would love to hear your feedback on the architecture, especially regarding how we handle the ManagedStream state machine.
Great work on open-sourcing the orchestrator. Full-duplex and barge-in are definitely the hardest parts to nail—getting those audio buffers cleared and the LLM stream killed in sub-500ms makes or breaks the "human" feel.
Curious about how you're handling VAD in noisy environments—do you find the RMS-based approach holds up well for telephony, or are you considering a more robust model-based VAD (like Silero) for the future?
We're tackling similar low-latency orchestration challenges at eboo.ai. It's great to see more Go-based tools in this space. Subscribed to the repo!
Barge-in is a total nightmare. Clearing those buffers fast enough to kill the 'ghost audio' without the LLM stuttering is exactly what we’re fighting right now.
You're spot on about VAD, too. RMS is our 'MVP debt', it’s fine for clean mics, but we’re definitely looking at a Silero bridge for telephony/noisy environments.
Also, we actually built this because we run Lokutor (ultra-low latency TTS). If you guys at eboo.ai are hunting for faster inference, hit me up—would love to get you a key to play with.
unknown|13 days ago
[deleted]
dani-lokutor|20 days ago
We’re open-sourcing the Go orchestrator we built at Lokutor (https://github.com/lokutor-ai/lokutor-orchestrator).
Building a voice agent that feels like a human is 20% model quality and 80% orchestration. The "standard" approach—daisy-chaining STT, LLM, and TTS APIs—usually results in a 2-3 second delay that kills the conversation. We also found that implementing "Barge-in" (the ability to interrupt the bot) is surprisingly tricky to get right across multiple streaming providers.
We chose Go because voice orchestration is essentially a high-concurrency plumbing problem. You’re managing several bidirectional streams (WebSockets/gRPC) while calculating RMS for VAD (Voice Activity Detection) and managing a state machine that needs to respond in milliseconds when it detects user speech.
What’s inside:
Full-Duplex: Capture and playback occur simultaneously without audio feedback loops. Native Barge-in: When the user starts speaking, the orchestrator immediately kills the LLM generation and clears the TTS audio buffers. Built-in RMS VAD: Thread-safe voice activity detection out of the box. Provider Agnostic: Swap between Groq, OpenAI, Deepgram, Anthropic, and our own Versa engine. Minimal Latency: Designed to add <10ms of overhead on top of the provider latencies. We've used this to build agents that handle sub-500ms end-to-end response times. We would love to hear your feedback on the architecture, especially regarding how we handle the ManagedStream state machine.
GitHub: https://github.com/lokutor-ai/lokutor-orchestrator
Docs: https://pkg.go.dev/github.com/lokutor-ai/lokutor-orchestrato...
PranayKumarJain|18 days ago
Curious about how you're handling VAD in noisy environments—do you find the RMS-based approach holds up well for telephony, or are you considering a more robust model-based VAD (like Silero) for the future?
We're tackling similar low-latency orchestration challenges at eboo.ai. It's great to see more Go-based tools in this space. Subscribed to the repo!
dani-lokutor|13 days ago
You're spot on about VAD, too. RMS is our 'MVP debt', it’s fine for clean mics, but we’re definitely looking at a Silero bridge for telephony/noisy environments.
Also, we actually built this because we run Lokutor (ultra-low latency TTS). If you guys at eboo.ai are hunting for faster inference, hit me up—would love to get you a key to play with.