HMMs haven't been state of the art in speech recognition for decades (I.e. since it actually got good). It's all end-to-end DNNs now. Basically raw input -> DNN -> ASCII.
Well almost anyway - last I checked they feed a Mel spectrogram into the model rather than raw audio samples.
> state of the art in speech recognition for decades
Decades doesn't sound right. Around 2019, the Jasper model was SOTA among e2e models but was still slightly behind a non e2e model with an HMM component https://arxiv.org/pdf/1904.03288
IshKebab|1 year ago
Well almost anyway - last I checked they feed a Mel spectrogram into the model rather than raw audio samples.
pcwelder|1 year ago
Decades doesn't sound right. Around 2019, the Jasper model was SOTA among e2e models but was still slightly behind a non e2e model with an HMM component https://arxiv.org/pdf/1904.03288