top | item 46547296

(no title)

woodson | 1 month ago

This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.

discuss

coder543|1 month ago

Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me.

ben_w|1 month ago

Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"

As with other replies, yes this is a silly name.

nateb2022|1 month ago

Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.

oofbey|1 month ago

It’s nonsensical to call it “zero shot” when a sample of the voice is provided. The term “zero shot cloning” implies you have some representation of the voice from another domain - e.g. a text description of the voice. What they’re doing is ABSOLUTELY one shot cloning. I don’t care if lots of STT folks use the term this way, they’re wrong.

woodson|1 month ago

I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

geocar|1 month ago

So if you get your target to record (say) 1 hour of audio, that's a one-shot.

If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?

nateb2022|1 month ago

> This generic answer from Wikipedia is not very helpful in this context.

Actually, the general definition fits this context perfectly. In machine learning terms, a specific 'speaker' is simply a 'class.' Therefore, a model generating audio for a speaker it never saw during training is the exact definition of the Zero-Shot Learning problem setup: "a learner observes samples from classes which were not observed during training," as I quoted.

Your explanation just rephrases the very definition you dismissed.

woodson|1 month ago

From your definition:

> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.

That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.