It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.
I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).
hulahoof|9 years ago
PieSquared|9 years ago