Not a very good demo page. It's difficult to judge real world quality with such unenthusiastic reading, unrealistic sentences, and unfamiliar voices. Typical of speech papers. It would be much better if celebrities were used as target voices, as we all know what they sound like and can therefore judge quality better. But I suppose that would be too controversial for Google.
In general I think it is silly that voice cloning research has focused so much (exclusively?) on cloning voices from just a few seconds of audio. It puts a pretty low ceiling on quality. Many nuances of a person's communication style will not be contained in such a small amount of data. Sure you can match their pitch and timbre, but voice cloning should be more than that.
> But I suppose that would be too controversial for Google.
You don't have to suppose anything: it is actually settled law that its bad to just willy-nilly use people's voices if you feel like it, even if its just a sound-alike!
For those confused as I was - it's not trying to match the accent of the target speech in those samples, just the timbre. To quote the paper:
> Voice conversion refers to altering the style of a speech signal while preserving its linguistic content. While style encompasses many aspects of speech, such as emotion, prosody, accent, and whispering, in this work we focus on the conversion of speaker timbre only while keeping the linguistic and para-linguistic information unchanged.
modeless|1 year ago
In general I think it is silly that voice cloning research has focused so much (exclusively?) on cloning voices from just a few seconds of audio. It puts a pretty low ceiling on quality. Many nuances of a person's communication style will not be contained in such a small amount of data. Sure you can match their pitch and timbre, but voice cloning should be more than that.
refulgentis|1 year ago
You don't have to suppose anything: it is actually settled law that its bad to just willy-nilly use people's voices if you feel like it, even if its just a sound-alike!
ascorbic|1 year ago
> Voice conversion refers to altering the style of a speech signal while preserving its linguistic content. While style encompasses many aspects of speech, such as emotion, prosody, accent, and whispering, in this work we focus on the conversion of speaker timbre only while keeping the linguistic and para-linguistic information unchanged.