I don't think many people know how amazing Elixir has become at machine learning. If you want to learn more I can't recommend Seam Moriarity's book Machine Learning in Elixir enough. Concepts are explained in extremely straight forward language and there's loads of examples!
Nice writeup! Super interesting that we both took different paths, but ended up with similar latencies.
I built a real-time conversation platform in Elixir. I used the Membrane framework to coordinate amongst the STT, LLM and TTS steps. I also ended up with latency in the ~1300 ms range.
I found research that says the typical human response time is 250 to 300 ms [0] in a conversation, so I think that should be the goal.
For my solution, some of the things we did to get latency as low as possible:
1. We stream the audio to the TTS endpoint. If you're transcribing as the audio comes in, then all you care about is the tail latency (the time between when the audio ends and the final transcript arrives). That helped a bunch for us. Google is around 200 ms with this approach.
2. Gpt 3.5 still has a time to first token of ~350 to ~400 ms. I couldn't find a way around that. But you can stream those tokens to ElevenLabs and start getting audio faster which helps.
3. ElevenLabs eats us most of the latency budget. Even with their turbo model their latency is 600-800 ms according to my timings. Again, streaming the words in (not tokens) and calling flush seemed to help.
The key I found was to cover up the latency. We respond immediately with some filler audio. The trick was getting the LLM to be aware of the filler audio text and continue naturally from that point
This matches my experience doing it with Elixir/OpenAI/ElevenLabs as well.
Depending on the application it’s also possible to fire the whole thing off pre-emptively, and then use the early response unless later context explicitly invalidates it.
Another cool trick to get around TTS latency is to maintain an audio cache keyed by semantic meaning, and get the LLM to choose from the cache. This saves high TTS API costs too.
1.3s imo is a fine time frame to start actually speaking. Humans, well most of us anyway, don’t start speaking informative words right away. Instead we add in “umm”s, inhales, “mhm”s, “yeah…”s and so on.
I think your approach is a good one. I’m now wondering for these filler sounds, do you contextualize them somehow? That is make filler feel more natural.
Slightly off-topic, but there isn't anyway to tag other HN users is there? Interested to see whether Sean could use any of your methods to improve his own approach.
> Now, if you’re wondering if I spent $99 to save some milliseconds for a meaningless demo, the answer is absolutely yes I did.
Godspeed, soldier.
> I was very excited for this problem in particular because it’s literally the perfect application of Elixir and Phoenix. If you are building conversational agents, you should seriously consider giving Elixir a try. A large part of how quick this demo was to put together is because of how productive Elixir is.
Back in the pre-GPT era I built a chatbot with LiveView, it is a a fantastic fit for assistants.
TLDR: The problem domain (telecom) fits Elixir perfectly
If you’re talking about scaling this, Elixir is built on the BEAM VM which was originally made for Ericsson and is tailor made for telecom systems.
Its whole paradigm is built around the concept of Let It Fail, which is basically about achieving fault tolerance through isolation and supervision.
So aside from the fact that Elixir+Phoenix is a productive framework that allowed the author to build this in a few days, it also means that it will scale very well with minimal code changes.
For reference, one of the solutions you might use to distribute this in Python is Celery, which is built on RabbitMQ which is built on Erlang, which is the predecessor of Elixir.
andy_ppp|1 year ago
https://pragprog.com/titles/smelixir/machine-learning-in-eli...
jatins|1 year ago
For example can I load or fine tune a model pre-trained in pytorch/JAX in Axon? Or does everything need to be written from ground up in Elixir?
thibaut_barrere|1 year ago
That coupled with LiveView + (quite easy scaling in general) results into interesting opportunities.
jonvk|1 year ago
enraged_camel|1 year ago
TonyHaenn|1 year ago
I built a real-time conversation platform in Elixir. I used the Membrane framework to coordinate amongst the STT, LLM and TTS steps. I also ended up with latency in the ~1300 ms range.
I found research that says the typical human response time is 250 to 300 ms [0] in a conversation, so I think that should be the goal.
For my solution, some of the things we did to get latency as low as possible: 1. We stream the audio to the TTS endpoint. If you're transcribing as the audio comes in, then all you care about is the tail latency (the time between when the audio ends and the final transcript arrives). That helped a bunch for us. Google is around 200 ms with this approach.
2. Gpt 3.5 still has a time to first token of ~350 to ~400 ms. I couldn't find a way around that. But you can stream those tokens to ElevenLabs and start getting audio faster which helps.
3. ElevenLabs eats us most of the latency budget. Even with their turbo model their latency is 600-800 ms according to my timings. Again, streaming the words in (not tokens) and calling flush seemed to help.
The key I found was to cover up the latency. We respond immediately with some filler audio. The trick was getting the LLM to be aware of the filler audio text and continue naturally from that point
[0] https://journalofcognition.org/articles/10.5334/joc.268#
nojs|1 year ago
Depending on the application it’s also possible to fire the whole thing off pre-emptively, and then use the early response unless later context explicitly invalidates it.
Another cool trick to get around TTS latency is to maintain an audio cache keyed by semantic meaning, and get the LLM to choose from the cache. This saves high TTS API costs too.
theflyinghorse|1 year ago
abrookewood|1 year ago
birracerveza|1 year ago
> Now, if you’re wondering if I spent $99 to save some milliseconds for a meaningless demo, the answer is absolutely yes I did.
Godspeed, soldier.
> I was very excited for this problem in particular because it’s literally the perfect application of Elixir and Phoenix. If you are building conversational agents, you should seriously consider giving Elixir a try. A large part of how quick this demo was to put together is because of how productive Elixir is.
Back in the pre-GPT era I built a chatbot with LiveView, it is a a fantastic fit for assistants.
I might pick it up again, it was pretty fun.
recurser|1 year ago
jasonjmcghee|1 year ago
Also, have you checked out XTTSv2 over StyleTTS2?
meatyapp|1 year ago
ac_alejos|1 year ago
If you’re talking about scaling this, Elixir is built on the BEAM VM which was originally made for Ericsson and is tailor made for telecom systems.
Its whole paradigm is built around the concept of Let It Fail, which is basically about achieving fault tolerance through isolation and supervision.
So aside from the fact that Elixir+Phoenix is a productive framework that allowed the author to build this in a few days, it also means that it will scale very well with minimal code changes.
For reference, one of the solutions you might use to distribute this in Python is Celery, which is built on RabbitMQ which is built on Erlang, which is the predecessor of Elixir.