TTS and STT models have decent support for streaming in chunks, but the accuracy drops the smaller the chunk size. Current state of LLMs are pretty limited in their ability to handle streaming inputs due to attention window constraints. There is some emerging research into attention sinks and caching initial tokens that look promising. I don't think we're quite there yet though.
No comments yet.