Are you folks planning on extending this to speech? I'm always been disappointed by how speech vocoder networks aren't built with any great inductive biases for waveform generation (besides very long receptive fields), and have desperately wanted something like this tuned for speech. It'd be great if a DSP-based architecture could be shown to outperform WaveNet / Parallel WaveNet / WaveRNN / WaveFlow / etc, and I'd love to use that in our own work. (There's been some attempts based on source-filter models like the "neural source filter (NSF) network", but nothing's caught on as best as I can tell.)
No comments yet.