Show HN: Open source framework OpenAI uses for Advanced Voice
266 points| russ | 1 year ago |github.com
The goal is to give everyone access to the same stack that underpins Advanced Voice in the ChatGPT app.
Under the hood it works like this: - A user's speech is captured by a LiveKit client SDK in the ChatGPT app - Their speech is streamed using WebRTC to OpenAI’s voice agent - The agent relays the speech prompt over websocket to GPT-4o - GPT-4o runs inference and streams speech packets (over websocket) back to the agent - The agent relays generated speech using WebRTC back to the user’s device
The Realtime API that OpenAI launched is the websocket interface to GPT-4o. This backend framework covers the voice agent portion. Besides having additional logic like function calling, the agent fundamentally proxies WebRTC to websocket.
The reason for this is because websocket isn’t the best choice for client-server communication. The vast majority of packet loss occurs between a server and client device and websocket doesn’t provide programmatic control or intervention in lossy network environments like WiFi or cellular. Packet loss leads to higher latency and choppy or garbled audio.
racecar789|1 year ago
Or, have the app call a pharmacy every month to refill prescriptions. For some drugs, the pharmacy requires a manual phone call to refill which gets very annoying.
So many use cases for this.
TZubiri|1 year ago
The IRS is notorious for resistance to tech change, don't be surprised if they unplug the phones and force you to walk in to ask your question.
What is the value add here? Save sometime for technocrats and technoadjacents for a whole of 3 years before victims of spam adapt?
Also this has been solved already just mail your question like the rest of mortals.
beeboobaa3|1 year ago
The point if phone lines is to waste the client's time. Not to have the client waste their time.
fosheezy|1 year ago
throw14082020|1 year ago
OpenAI hired the ex fractional CTO of LiveKit, who created Pion, a popular WebRTC library/tool.
I'd expect OpenAI to migrate off of LiveKit within 6 months. LiveKit is too expensive. Also, WebRTC is hard, and OpenAI now being a less open company will want to keep improvements to itself.
Not affiliated with any competitors, but I did work at a PaaS company similar to LiveKit but used Websockets instead.
fidotron|1 year ago
Most of it is open source, especially the clients, although they do feel quite ad hoc hacked together (a possible side effect of WebRTC evolution).
Would totally agree on OpenAI moving away. The description of the agent here sounds like a big hack just to get around the fact temporarily the model server expects audio over sockets instead.
russ|1 year ago
Fractional CTO sounds like a disaster lol
pj_mukh|1 year ago
Does the pricing breakdown to be the same as having a OpenAI Advanced Voice socket open the whole time? It's like $9/hr!
It would be theoretically cheaper to use this without keeping the advanced voice socket open the whole time and just use the GPT4o streaming service [1] for whenever inference is needed (pay per token) and use livekits other components to do the rest (TTS, VAD etc.).
What's the trade off here?
[1]: https://platform.openai.com/docs/api-reference/streaming
davidz|1 year ago
However, we are working on turn detection within the framework, so you won't have to send silence to the model when the user isn't talking. It's a fairly straight forward path to cutting down the cost by ~50%.
npace12|1 year ago
solarkraft|1 year ago
By the way: The cerebras voice demo also uses LiveKit for this: https://cerebras.vercel.app/
russ|1 year ago
FanaHOVA|1 year ago
russ|1 year ago
shayps|1 year ago
spuz|1 year ago
Ey7NFZ3P0nzAe|1 year ago
There's also llama-omni and a few others. None of them are even close to 4o from an LLM standpoint. But moshi is called a "foundational" model and U'm hopeful it will be enhanced. Also there's not yet support for those on most backends like llamacpp / ollama etc. So I'd say we're in a trough but we'll get there.
russ|1 year ago
Their model builds a speech-to-speech layer into Llama. Last I checked they have the audio-in part working and they’re working on the audio-out piece.
0x1ceb00da|1 year ago
mycall|1 year ago
davidz|1 year ago
0x1ceb00da|1 year ago
But when I asked advanced voice mode it said the exact opposite. That it receives input as audio and generates text as output.
mbrock|1 year ago
meiraleal|1 year ago
unknown|1 year ago
[deleted]
gastonmorixe|1 year ago
There is a common consensus that the new Realtime API is not actually using the same Advanced Voice model / engine - or however it works - since at least the TTS part doesn’t seem to be as capable as the one shipped with the official OpenAI app.
Any idea on this?
Source: https://github.com/openai/openai-realtime-api-beta/issues/2
russ|1 year ago
One thing to note is there is no separate TTS-phase here, it's happening internally within GPT-4o, in the Realtime API and Advanced Voice.
lolpanda|1 year ago
willsmith72|1 year ago
russ|1 year ago