top | item 39376121

(no title)

nshm | 2 years ago

Err, I deeply respect Amazon TTS team but this paper and synthesis is..... You publish the paper in 2024 and include YourTTS in your baselines to look better. Come on! There is XTTS2 around!

Voice sounds robotic and plain. Most likely a lot of audiobooks in training data and less conversational speech. And dropping diffusion was not a great idea, voice is not crystal clear anymore, it is more like a telephony recording.

discuss

thorum|2 years ago

xtts2 is great, but it looks like this model is probably more consistent with its output and has a better grasp of meaning in long texts.