I understand that, though I think significant speedups can be useful at multiple levels. So for me for example I am using either the base or small model on beam size of 1 with faster-whisper for real-time dictation on a laptop CPU (Rzyen 4500U). The recognition time is just that bit too high when using a larger beam size or is much too high when using the medium model. So if these models offer a decent speed up it means I can either increase beam size or go up a model size which will lead to good improvement in accuracy I think - With real-time dictation I find that small errors are quite annoying to deal with so any improvement in accuracy is really useful.At a larger level, say an exercise to transcribe a back catalogue of audio might need a $1000 GPU with the current model speeds to get the job done in a reasonable time. With models that run 6x faster it might be that a $200 GPU is sufficient. That could be quite a significant saving for a small company or charity etc.
regularfry|2 years ago
That being said, even with this distillation there's still the aspect that Whisper isn't really designed for streaming. It's fairly simplistic and always deals with 30 second windows. I was expecting there to have been some sort of useful transform you could do to the model to avoid quite so much reprocessing per frame, but other than https://github.com/mit-han-lab/streaming-llm (which I'm not even sure directly helps) I haven't noticed anything out there.