(no title)
Spiwux | 2 years ago
User speaks and speech to text starts streaming text while the user is still speaking. That text stream is piped into a LLM, which also streams its output text. That output text is streamed to text-to-speech, which also generates audio in a streaming manner.
modeless|2 years ago
The speech recognition part needs work for sure, but when it works you can see the potential. It's very different from the way it feels to talk to Siri or even ChatGPT's voice mode. It won't be long before we are having real conversations with our computers.
bjelkeman-again|2 years ago
3abiton|2 years ago
evilantnie|2 years ago
everforward|2 years ago
It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.
The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.
canadiantim|2 years ago
zaptrem|2 years ago
adroitboss|2 years ago
fudged71|2 years ago
WiSaGaN|2 years ago