top | item 38342108

(no title)

jsjmch | 2 years ago

See my previous comment about this point. ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. XTTS was also trained with probably millions of speakers in more than 20 languages.

If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

discuss

lossolo|2 years ago

> It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

It's really not that difficult, they are trained mostly on audiobooks and high quality audio from yt videos. If we talk about EV model then we are talking about around 500k hours of audio, but Tortoise-TTS is only around 50k from what I remember.

wczekalski|2 years ago

What's your basis for the claim that they are based on TorToiSe? I have seen this claim made (and rebutted) many times.

jsjmch|2 years ago

Very similar features, quite slow inference speed, and various rumors.