Implementing Natural Conversational Agents with Elixir

andy_ppp|1 year ago

I don't think many people know how amazing Elixir has become at machine learning. If you want to learn more I can't recommend Seam Moriarity's book Machine Learning in Elixir enough. Concepts are explained in extremely straight forward language and there's loads of examples!

https://pragprog.com/titles/smelixir/machine-learning-in-eli...

jatins|1 year ago

Does Elixir ML ecosystem libs (Nx, Axon) provide some sort of interop with Python ecosystem?

For example can I load or fine tune a model pre-trained in pytorch/JAX in Axon? Or does everything need to be written from ground up in Elixir?

thibaut_barrere|1 year ago

Yes, this is getting quite exciting. There is cross-pollinisation of concepts going on (e.g. https://www.youtube.com/watch?v=RABXu7zqnT0 which shows a port of Python's Instructor library to https://github.com/thmsmlr/instructor_ex, https://hexdocs.pm/scholar/Scholar.html etc!).

That coupled with LiveView + (quite easy scaling in general) results into interesting opportunities.

jonvk|1 year ago

I'm guessing you mean that you would recommend it. You say you can't.

enraged_camel|1 year ago

It says the book is in beta. How complete/finished is it?

TonyHaenn|1 year ago

Nice writeup! Super interesting that we both took different paths, but ended up with similar latencies.

I built a real-time conversation platform in Elixir. I used the Membrane framework to coordinate amongst the STT, LLM and TTS steps. I also ended up with latency in the ~1300 ms range.

I found research that says the typical human response time is 250 to 300 ms [0] in a conversation, so I think that should be the goal.

For my solution, some of the things we did to get latency as low as possible: 1. We stream the audio to the TTS endpoint. If you're transcribing as the audio comes in, then all you care about is the tail latency (the time between when the audio ends and the final transcript arrives). That helped a bunch for us. Google is around 200 ms with this approach.

2. Gpt 3.5 still has a time to first token of ~350 to ~400 ms. I couldn't find a way around that. But you can stream those tokens to ElevenLabs and start getting audio faster which helps.

3. ElevenLabs eats us most of the latency budget. Even with their turbo model their latency is 600-800 ms according to my timings. Again, streaming the words in (not tokens) and calling flush seemed to help.

The key I found was to cover up the latency. We respond immediately with some filler audio. The trick was getting the LLM to be aware of the filler audio text and continue naturally from that point

[0] https://journalofcognition.org/articles/10.5334/joc.268#

nojs|1 year ago

This matches my experience doing it with Elixir/OpenAI/ElevenLabs as well.

Depending on the application it’s also possible to fire the whole thing off pre-emptively, and then use the early response unless later context explicitly invalidates it.

Another cool trick to get around TTS latency is to maintain an audio cache keyed by semantic meaning, and get the LLM to choose from the cache. This saves high TTS API costs too.

theflyinghorse|1 year ago

1.3s imo is a fine time frame to start actually speaking. Humans, well most of us anyway, don’t start speaking informative words right away. Instead we add in “umm”s, inhales, “mhm”s, “yeah…”s and so on. I think your approach is a good one. I’m now wondering for these filler sounds, do you contextualize them somehow? That is make filler feel more natural.

abrookewood|1 year ago

Slightly off-topic, but there isn't anyway to tag other HN users is there? Interested to see whether Sean could use any of your methods to improve his own approach.

birracerveza|1 year ago

Excellent article.

> Now, if you’re wondering if I spent $99 to save some milliseconds for a meaningless demo, the answer is absolutely yes I did.

Godspeed, soldier.

> I was very excited for this problem in particular because it’s literally the perfect application of Elixir and Phoenix. If you are building conversational agents, you should seriously consider giving Elixir a try. A large part of how quick this demo was to put together is because of how productive Elixir is.

Back in the pre-GPT era I built a chatbot with LiveView, it is a a fantastic fit for assistants.

I might pick it up again, it was pretty fun.

recurser|1 year ago

Great write-up! I'm really interested in this area but have minimal experience, and I learnt a lot from this.

jasonjmcghee|1 year ago

Is ElevenLabs Turbo v2 faster than streaming OpenAI TTS?

Also, have you checked out XTTSv2 over StyleTTS2?

meatyapp|1 year ago

how does Elixir+Phoenix help for this sort of use case instead of just using Python or JavaScript? thanks for any info!

ac_alejos|1 year ago

TLDR: The problem domain (telecom) fits Elixir perfectly

If you’re talking about scaling this, Elixir is built on the BEAM VM which was originally made for Ericsson and is tailor made for telecom systems.

Its whole paradigm is built around the concept of Let It Fail, which is basically about achieving fault tolerance through isolation and supervision.

So aside from the fact that Elixir+Phoenix is a productive framework that allowed the author to build this in a few days, it also means that it will scale very well with minimal code changes.

For reference, one of the solutions you might use to distribute this in Python is Celery, which is built on RabbitMQ which is built on Erlang, which is the predecessor of Elixir.

23 comments