Yes, it's not that clear to me either what test sets get a 10% error rate. Because in my use (native English dictation or native English podcast transcription) the small or medium original whisper models have what I'll call a "discrepancy" rate of say 1-2% which is mostly punctuation and "umms/errs" inclusion or not. The actual "error" rate is below 1% in my experience, and excluding surnames, brands and place names that I don't know how to spell either the remaining errors tend to be minor (missed plural etc.).So I infer that these data sets are some deliberately difficult audio: call centre recordings with lots of background noise, phoneline quality audio etc. Maybe non-native speakers. If I only heard that sort of audio once I also might have an error rate of 10%.
No comments yet.