top | item 38096706

(no title)

rjwilmsi | 2 years ago

I understand that, though I think significant speedups can be useful at multiple levels. So for me for example I am using either the base or small model on beam size of 1 with faster-whisper for real-time dictation on a laptop CPU (Rzyen 4500U). The recognition time is just that bit too high when using a larger beam size or is much too high when using the medium model. So if these models offer a decent speed up it means I can either increase beam size or go up a model size which will lead to good improvement in accuracy I think - With real-time dictation I find that small errors are quite annoying to deal with so any improvement in accuracy is really useful.

At a larger level, say an exercise to transcribe a back catalogue of audio might need a $1000 GPU with the current model speeds to get the job done in a reasonable time. With models that run 6x faster it might be that a $200 GPU is sufficient. That could be quite a significant saving for a small company or charity etc.

discuss

regularfry|2 years ago

Oh yes, that's absolutely true - faster is better for everyone. It's just that this particular breakpoint would put realtime transcription on a $17 device with an amazing support ecosystem. It's wild.

That being said, even with this distillation there's still the aspect that Whisper isn't really designed for streaming. It's fairly simplistic and always deals with 30 second windows. I was expecting there to have been some sort of useful transform you could do to the model to avoid quite so much reprocessing per frame, but other than https://github.com/mit-han-lab/streaming-llm (which I'm not even sure directly helps) I haven't noticed anything out there.