top | item 31419412

(no title)

Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.

discuss

Arbortheus|3 years ago

I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.

kevin_thibedeau|3 years ago

Dr. Sbaitso ran on a modest 386. Mobile device processors generally eclipse that and could definitely generate better quality TTS.

ben_w|3 years ago

Not only is state of the art TTS much more demanding (and much much higher quality) than Dr. Sbaitso[0], but so are the not-quite-so-good TTS engines in both Android and iOS.

That said, having only skimmed the paper I didn’t notice a discussion of the compute requirements for usage (just training), but it did say it was a 28.7 million parameter model, so I recon this could be used in real-time on a phone.

[0] judging by the videos of Dr. Sbaitso on YouTube, it was only one step up from the intro to Impossible Mission on the Commodore 64.

ccbccccbbcccbb|3 years ago

The parent didn't mention real-time as a requirement. Offline rendering would well suffice.

SemanticStrengh|3 years ago

28.7 million parameter is nothing for inference

snek_case|3 years ago

Often you can prune parameters as well. You might be able to cut that down by a factor of 10 without any noticeable loss in accuracy.