Open source tortoise-TTS has been able to do this for 6+ months now, which is also based on the same theory as DALL-E. From playing with tortoise a bit over the last couple of weeks it seems like the issue is not so much accuracy anymore, rather how GPU intensive it is to make a voice of any meaningful duration. Tortoise is ~5 seconds on a $1000 GPU (P5000) to do one second of spoken text. There's cloud options (collab, paperspace, runpod) but still https://github.com/neonbjb/tortoise-tts
ShamelessC|3 years ago
I agree though, Tortoise TTS did a lot of similar work IIRC by a single person on their multi-GPU setup. Really impressive effort. Did they get a citation? They deserve one.
edit: reading other comments it seems there is a misconception that the model takes 3 seconds to run? That isn't the case - it requires "just" 3 seconds of example audio to successfully clone a voice (for some definition of success).
morrisjm|3 years ago
rtx5000 maybe but not sure how much of a value improvement there is
jordibc|3 years ago
morrisjm|3 years ago
NayamAmarshe|3 years ago