This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.
coder543|1 month ago
ben_w|1 month ago
As with other replies, yes this is a silly name.
nateb2022|1 month ago
If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.
oofbey|1 month ago
woodson|1 month ago
geocar|1 month ago
If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?
nateb2022|1 month ago
Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.
Your explanation just rephrases the very definition you dismissed.
woodson|1 month ago
> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.
That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.