nicktikhonov's comments

nicktikhonov | 1 day ago | on: Nvidia PersonaPlex 7B on Apple Silicon: Full-Duplex Speech-to-Speech in Swift

From what I've seen, it's really easy to get PersonaPlex stuck in a death spiral - talking to itself, stuttering and descending deeper and deeper into total nonsense. Useless for any production use case. But I think this kind of end-to-end model is needed to correctly model conversations. STT/TTS compresses a lot of information - tone, timing, emotion out of the input data to the model, so it seems obvious that the results will always be somewhat robotic. Excited to see the next iteration of these models!

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

Yep. Seems like caching more broadly is something worth exploring next if I were to do a pt2.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

Yep. I've been learning Chinese for the past 3 months, so the name was a fold-in inspiration from my other hobby :)

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

Glad to hear! I built my blog on top of NextJS - it basically just renders .mdx files with contentlayer. One of the things I discovered is that you can easily vibe-code these explainer widgets. Seems like a perfect use case for vibe coding - each is a simple react component and I can keep iterating until I get it working just the way I like. And super easy to interleave with content. Seems like this could be an obvious feature addition to all the blogging platforms.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

You're probably right, at least at scale this could help

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

One thing you can get the LLM to do is to call a "skip turn" tool, which will basically trigger the system to wait without saying anything. Then all it will take is clever prompting to get the desired result.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

I feel like you could get pretty far with a raspberry pi and microphone/speaker. I think the hard part is running a model that can detect a "Hey agent" on-device, so that it can run 24/7 and hand off to the orchestrator when it catches a real question/query.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

This is fascinating, thanks for sharing! I wonder why amazon/google/apple didn't hop on the voice assistant/agent train in the last few years. All 3 have existing products with existing users and can pretty much define and capture the category with a single over-the-air update.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

I'd say it was a collaboration. I had to hand-hold Claude quite a bit in the early stages, especially with architecture, and find the right services to get the outcome I wanted. But if you care most about where the code came from - it was probably 85-90% LLM, and that's fantastic given that the result is as performant as anything you'll be able to find out of the box.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

100% - I thought about that shortly after writing this up. One way to make this work is to have a tiny, lower latency model generate that first reply out of a set of options, then aggressively cache TTS responses to get the latency super low. Responses like "Hmm, let me think about that..." would be served within milliseconds.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

Gross

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

A friend built this, everything working in-browser:

https://ttslab.dev/voice-agent

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

Very cool! starred and on my reading list. Would love to chat and share notes, if you'd like

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

If you're of that opinion, you'll enjoy the new stuff coming out from nvidia:

https://research.nvidia.com/labs/adlr/personaplex/

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

I'm sure LiveKit or similar would be best to use in production. I'm sure these libraries handle a lot of edge cases, or at least let you configure things quite well out of the box. Though maybe that argument will become less and less potent over time. The results I got were genuinely impressive, and of course most of the credit goes to the LLM. I think it's worth building this stuff from scratch, just so that you can be sure you understand what you'll actually be running. I now know how every piece works and can configure/tune things more confidently.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

I was using Twilio, and as far as I'm aware they handle any echos that may arise. I'm actually not sure where in the telephony stack this is handled, but I didn't see any issues or have to solve this problem myself luckily.

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

I didn't try Soniox, but I made a note to check it out! I chose Flux because I was already using Deepgram for STT and just happened to discover it when I was doing research. It would definitely be a good follow-up to try out all the different endpointing solutions to see what would shave off additional latency and feel most natural.

Another good follow-up would be to try PersonaPlex, Nvidia's new model that would completely replace this architecture with a single model that does everything:

https://research.nvidia.com/labs/adlr/personaplex/

nicktikhonov | 3 days ago | on: Show HN: I built a sub-500ms latency voice agent from scratch

If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.

nicktikhonov | 25 days ago | on: Show HN: I built a platform that connects people with travelers to carry items

I'd do this for friends, but at scale this is unfortunately customs fraud

nicktikhonov | 25 days ago | on: Building an AI voice agent from scratch

I spent a day (~$100 in API credits) rebuilding the core orchestration loop of a real-time AI voice agent from scratch instead of using an all-in-one SDK. The hard part isn’t STT, LLMs, or TTS in isolation, but turn-taking: detecting when the user starts and stops speaking, cancelling in-flight generation instantly, and pipelining everything to minimize time-to-first-audio.

The write-up covers why VAD alone fails for real turn detection, how voice agents reduce to a minimal speaking/listening loop, why STT → LLM → TTS must be streaming rather than sequential, why TTFT matters more than model quality in voice, and why geography dominates latency. By colocating Twilio, Deepgram, ElevenLabs, and the orchestration layer, I reached ~790ms end-to-end latency, slightly faster than an equivalent Vapi setup.