(no title)
tylerneylon | 1 year ago
It's a fair question, and I don't have all the answers. But for this question, there might be training data available from everyday human conversations. For example, we could use a speech-to-text model that's able to distinguish speakers, and look for points where one person decided to start speaking (that would be training data for when to switch modes). Ideally, the speech-to-text model would be able to include text even when both people spoke at once (this would provide more realistic and complete training data).
I've noticed that the audio mode in ChatGPT's app is good at noticing when I'm done speaking to it, and it reacts accurately enough that I suspect it's more sophisticated than "wait for silence." If there is a "notice the end of speaking" model - which is not a crazy assumption - then I can imagine a slightly more complicated model that notices a combination of "now is a good time to talk + I have something to say."
No comments yet.