top | item 46641203

(no title)

gropo | 1 month ago

Kokoro is better for tts by far

For voice cloning, pocket tts is walled so I can't tell

discuss

echelon|1 month ago

What are the advantages of PocketTTS over Kokoro?

It seems like Kokoro is the smaller model, also runs on CPU in real time, and is more open and fine tunable. More scripts and extensions, etc., whereas this is new and doesn't have any fine tuning code yet.

I couldn't tell an audio quality difference.

hexaga|1 month ago

Kokoro is fine tunable? Speaking as someone who went down the rabbit hole... it's really not. There's no (as of last time I checked) training code available so you need to reverse engineer everything. Beyond that the model is not good at doing voices outside the existing voicepacks: simply put, it isn't a foundation model trained on internet scale data. It is made from a relatively small set of focused, synthetic voice data. So, a very narrow distribution to work with. Going OOD immediately tanks perceptual quality.

There's a bunch of inference stuff though, which is cool I guess. And it really is a quite nice little model in its niche. But let's not pretend there aren't huge tradeoffs in the design: synthetic data, phonemization, lack of train code, sharp boundary effects, etc.

jamilton|1 month ago

Being able to voice clone with PocketTTS seems major, it doesn't look like there's any support for that with Kokoro.

jhatemyjob|1 month ago

Less licensing headache, it seems. Kokoro says its Apache licensed. But it has eSpeak-NG as a dependency, which is GPL, which brings into question whether or not Kokoro is actually GPL. PocketTTS doesn't have eSpeak-NG as a dependency so you don't need to worry about all that BS.

Btw, I would love to hear from someone (who knows what they're talking about) to clear this up for me. Dealing with potential GPL contamination is a nightmare.

seunosewa|1 month ago

Chatterbox-turbo is really good too. Has a version that uses Apple's gpu.