top | item 38338225

(no title)

eginhard | 2 years ago

STT training data includes all kinds of "noisy" speech so that the model learns to recognise speech in any conditions. TTS training data needs to be as clean as possible so that you don't introduce artefacts in the output and this high-quality data is much harder to get. A simple inversion is not really feasible or at least requires filtering out much of the data.

discuss

No comments yet.