top | item 39178433

(no title)

pyryt | 2 years ago

Knowing when to speak is actually a prediction task in itself. See eg https://arxiv.org/abs/2010.10874

Would be indeed great to get something like this integrated with whisper, LLM and TTS

discuss

Hard for me to imagine that this could be solved in text space. I think the prediction task needs to be done on the audio.

stiffler01|2 years ago

We thought about doing this in Whisper itself, since its already working in the audio space.

stiffler01|2 years ago

Yes, this is something we want to look into in more detail, really appreciate sharing the research.