top | item 41007895

(no title)

You are missing the speech decoding part. I can't speak to why the clients you were working on were doing what they were doing. For a different reference point see the cloud streaming api.

This is a good public reference: https://research.google/blog/an-all-neural-on-device-speech-...

Possibly confusions from that doc: "RNN-T" is entirely orthogonal to RNNs (and not the only streamable model). Attention is also orthogonal to streaming. A chunked or sliding window attention can stream, a bi-directional RNN cannot. How you think of an encoder and a decoder streaming is also different.

At a practical level, if a model is fast enough, and VAD is doing an adequate job, you can get something that looks like "streaming" which a non-streaming model. If a streaming model has tons of look-ahead or a very large input chunk size, its latency may not feel a lot better.

Where the difference is sharp is where VAD is not adequate: Users speak in continuous streams of audio, they leave in unusual gaps within sentences and run sentences together. A non-streaming system either hurts quality because sentences (or even words) get broken up that shouldn't, or has to wait forever and doesn't get a chance to run, when a streaming system would have already been producing output.

And to your points about echo cancellation and interference: There's many text only operations that benefit from being able to start early in the audio stream, not late.

I just went through process of helping someone stand up an interactive system with whisper etc and the lack of an open sourced whisper-quality streaming system is such a bummer because it really is so much laggier than it has to be.

discuss

No comments yet.