top | item 40843136

(no title)

Replying to: How would a model intelligently switch between listening or speaking modes? What data would you train on? (I'm the author of the parent article.)

It's a fair question, and I don't have all the answers. But for this question, there might be training data available from everyday human conversations. For example, we could use a speech-to-text model that's able to distinguish speakers, and look for points where one person decided to start speaking (that would be training data for when to switch modes). Ideally, the speech-to-text model would be able to include text even when both people spoke at once (this would provide more realistic and complete training data).

I've noticed that the audio mode in ChatGPT's app is good at noticing when I'm done speaking to it, and it reacts accurately enough that I suspect it's more sophisticated than "wait for silence." If there is a "notice the end of speaking" model - which is not a crazy assumption - then I can imagine a slightly more complicated model that notices a combination of "now is a good time to talk + I have something to say."

discuss

No comments yet.