top | item 13761555

(no title)

It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.

discuss

hulahoof|9 years ago

Do you expect derivatives of this to surpass the effort by Adobe with VoCo? From my untrained perspective they appear quite similar in functionality

PieSquared|9 years ago

I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).