(no title)
TonyHaenn | 1 year ago
I built a real-time conversation platform in Elixir. I used the Membrane framework to coordinate amongst the STT, LLM and TTS steps. I also ended up with latency in the ~1300 ms range.
I found research that says the typical human response time is 250 to 300 ms [0] in a conversation, so I think that should be the goal.
For my solution, some of the things we did to get latency as low as possible: 1. We stream the audio to the TTS endpoint. If you're transcribing as the audio comes in, then all you care about is the tail latency (the time between when the audio ends and the final transcript arrives). That helped a bunch for us. Google is around 200 ms with this approach.
2. Gpt 3.5 still has a time to first token of ~350 to ~400 ms. I couldn't find a way around that. But you can stream those tokens to ElevenLabs and start getting audio faster which helps.
3. ElevenLabs eats us most of the latency budget. Even with their turbo model their latency is 600-800 ms according to my timings. Again, streaming the words in (not tokens) and calling flush seemed to help.
The key I found was to cover up the latency. We respond immediately with some filler audio. The trick was getting the LLM to be aware of the filler audio text and continue naturally from that point
[0] https://journalofcognition.org/articles/10.5334/joc.268#
nojs|1 year ago
Depending on the application it’s also possible to fire the whole thing off pre-emptively, and then use the early response unless later context explicitly invalidates it.
Another cool trick to get around TTS latency is to maintain an audio cache keyed by semantic meaning, and get the LLM to choose from the cache. This saves high TTS API costs too.
Dowwie|1 year ago
theflyinghorse|1 year ago
TonyHaenn|1 year ago
Re: contextualizing the filler. No, but it's a good idea :) This thread made me think there's a way to generate one on the fly based on the first part of what the person has said. The challenge though is it seems to me that filler phrases usually relate to what the person said last, not first.
abrookewood|1 year ago