Thanks for your interests in NaturalSpeech and NaturalSpeech 2!
NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.
NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.
Check more discussions on reddit as well: https://www.reddit.com/r/singularity/comments/12rubq4/latent...
NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.
NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.
Check more discussions on reddit as well: https://www.reddit.com/r/singularity/comments/12rubq4/latent...