top | item 40076787

(no title)

TonyHaenn | 1 year ago

Nice writeup! Super interesting that we both took different paths, but ended up with similar latencies.

I built a real-time conversation platform in Elixir. I used the Membrane framework to coordinate amongst the STT, LLM and TTS steps. I also ended up with latency in the ~1300 ms range.

I found research that says the typical human response time is 250 to 300 ms [0] in a conversation, so I think that should be the goal.

For my solution, some of the things we did to get latency as low as possible: 1. We stream the audio to the TTS endpoint. If you're transcribing as the audio comes in, then all you care about is the tail latency (the time between when the audio ends and the final transcript arrives). That helped a bunch for us. Google is around 200 ms with this approach.

2. Gpt 3.5 still has a time to first token of ~350 to ~400 ms. I couldn't find a way around that. But you can stream those tokens to ElevenLabs and start getting audio faster which helps.

3. ElevenLabs eats us most of the latency budget. Even with their turbo model their latency is 600-800 ms according to my timings. Again, streaming the words in (not tokens) and calling flush seemed to help.

The key I found was to cover up the latency. We respond immediately with some filler audio. The trick was getting the LLM to be aware of the filler audio text and continue naturally from that point

[0] https://journalofcognition.org/articles/10.5334/joc.268#

discuss

order

nojs|1 year ago

This matches my experience doing it with Elixir/OpenAI/ElevenLabs as well.

Depending on the application it’s also possible to fire the whole thing off pre-emptively, and then use the early response unless later context explicitly invalidates it.

Another cool trick to get around TTS latency is to maintain an audio cache keyed by semantic meaning, and get the LLM to choose from the cache. This saves high TTS API costs too.

Dowwie|1 year ago

appointment scheduling seems like an ideal consumer of cached audio responses, but how can segments be concatenated into a naturally sounded response?

theflyinghorse|1 year ago

1.3s imo is a fine time frame to start actually speaking. Humans, well most of us anyway, don’t start speaking informative words right away. Instead we add in “umm”s, inhales, “mhm”s, “yeah…”s and so on. I think your approach is a good one. I’m now wondering for these filler sounds, do you contextualize them somehow? That is make filler feel more natural.

TonyHaenn|1 year ago

Depends on what you're aiming for. For my use case, I'm aiming for the feeling of talking to another human. I built an iOS app for little kids to call Santa. Low latency was important. Now I'm working on a mock interview experience; same deal, needs to feel like the real thing.

Re: contextualizing the filler. No, but it's a good idea :) This thread made me think there's a way to generate one on the fly based on the first part of what the person has said. The challenge though is it seems to me that filler phrases usually relate to what the person said last, not first.

abrookewood|1 year ago

Slightly off-topic, but there isn't anyway to tag other HN users is there? Interested to see whether Sean could use any of your methods to improve his own approach.