top | item 20823193

(no title)

First of all, it's important to note that Tacotron and WaveNet are responsible for different parts of the speech synthesis pipeline, so the comparison here isn't quite accurate. Specifically, Tacotron takes a representation of the text (characters, phonemes, etc) and converts it into a frame-level acoustic representation (spectrograms, log mel spectrograms, etc, spaced every 5-25ms). WaveNet takes a frame-level representation of the audio (for example, the output of Tacotron, or phonemes with frame-level timing information) and converts it to a waveform.

Second, I don't see any reason why there shouldn't be an open-source Tacotron or WaveNet implementation that's as good as Google's model implementations. Implementing and training these models is expensive but not prohibitively so (nowadays, you could probably do it with $5,000 - $10,000, including experimentation costs).

That said, quality of text-to-speech systems is determined only partially by the quality of these models -- much if not most of the work of building high quality text to speech systems goes into things like high quality data collection systems, good data annotations, good normalization and NLP tailored towards the domain of the TTS system, multilanguage support, optimized inference implementations for server or mobile platforms, etc.

discuss

No comments yet.