(no title)
PieSquared | 6 years ago
The main difference between something like Festival and what we have now is the amount of domain-specific engineering. (This is generally the promise of deep learning -- replace hand-engineered features with simple-to-understand features and a deep model.) If you go and read the Festival manual, you're going to find tons of domain-specific rules and heuristics and subroutines; for example, there's a page on writing letter to sound rules as a grammar [2]. Nowadays, we may have a pipeline that resembles Festival at the high level, but each step of the pipeline is learned as a deep model from data rather than being carefully hand-engineered by many people over the course of years. This yields much more fluid speech as well as much, much faster iteration and experimentation times, leading to faster progress as well.
[0] https://arxiv.org/abs/1811.11913
[1] https://people.xiph.org/~jm/demo/lpcnet/
[2] http://www.festvox.org/docs/manual-2.4.0/festival_13.html#Le...
No comments yet.