I'm glad that it doesn't. A lot of us use these voices as an accessibility tool in our screen readers. They need to perform well and be understandable at very high rate, and they need to be very responsive. ESpeak is one of the most responsive speech synths out there, so for a screen reader this means key press to speech output is extremely low. Adding AI would just make this a lot slower and unpredictable, and unusable for daily work, at least right now.
This is anecdotal, but part of what makes a synth work well at high speech rates is predictability. I know how a speech synth is going to say something exactly. This let's me put more focus on the thing I'm doing rather than trying to decipher what the synth is saying. Neural TTS always has differences in how they say a thing, and at times, those differences can be large enough to trip me up. Then I'm focusing on the speech again and not what I'm doing. But ESpeak is very predictable, so I can let my brain do the pattern matching and focus actively on something else.
No comments yet.