I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).
nateb2022|1 month ago
It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information."
> how would the model know what voice it should sound like
It uses the reference audio just like a text based model uses a prompt.
> unless it's a celebrity voice or similar included in the training data where it's enough to specify a name
If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before.
magicalhippo|1 month ago
Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot.
However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1].
So a bit overloaded term causing confusion from what I can gather.
[1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi...
unknown|1 month ago
[deleted]