A big caveat of this is that the author is just looking at a ranking of open models? It's just buried in a little sentence in there but makes a big difference to model quality. Kokoro in the overall rankings is only 15, so if #15 is what you consider the "best model" you need to be cognizant that you are leaving performance on the table.
I've heard a lot of Substacks voiced by Eleven Labs models and they seem fine (with the occasional weirdness around a proper noun.) Not a bad article but I think more examples of TTS usage would be more useful.
I guess the outcome is, open weight TTS models are only okay and could be a lot better?
Yeah, from my experience the more helpful conclusion is "TTS is not commoditized yet". At some point in the next 5 years, convincing TTS will be table stakes. But for now, paying for TTS gets you better results.
The paid models are still too expensive for personal long-form use-cases. For example: if I want to generate an audiobook from a web novel, the price can go as high as thousands of dollars. If I'm just a regular reader (not the author), that's prohibitively expensive for someone who just wants to enjoy the story in a different medium.
Yup, ElevenLabs stills rules pretty much in this space. Especially if you're looking for non-English models it's really hard to find anything good although the latest Chatterbox[1] now supports 23 languages.
>However, like many models in this leaderboard - I can’t use it - since it doesn’t support voice cloning.
That's such a strange requirement. A TTS is just that. It takes a text and speaks it out loud. The user generally doesn't care whose voice it is, and personally I think TTS's sharing the same voice is a good thing for authenticity since it lets users know that it's a TTS reading the script and not a real person.
You want your voice to be reading the script, but you don't want to personally record yourself reading the text? As far as I'm concerned that's an edge case. No wonder that TTS's can't do that properly since most people don't need that in first place.
For local TTS for a podcast I'd try the quantized .gguf versions of Microsoft VibeVoice large in comfyui to clone my voice from a ~30 second speech sample the apply it to marked-up text of the desired podcast. But it'd be nowhere near real time and require dedicating a $300 GPU to it. And the quantized version often goes off the rails and loses consistency in voice tone or accent. So just one run often isn't enough and I you have to piece the good parts of many separate runs together. It's not set it and forget it.
I do a lot of desktop screen-reader and pdf/doc/epub/etc text to speech every single day. It's been 20 years and I still use Festival 1.96 TTS with voice_nitech_us_slt_arctic_hts voice because it's so computational cheap and just slightly a step about normal festival/espeak/mbrolla/etc type TTS quality to be clear and tolerable. In terms of this local "do screenreader stuff really fast" use case I've tried modern TTS like vibevoice, kokoro tts, sherpa-onx, piper tts, orpheus tts, etc. And they all have consistency issues, many are way too slow even with a $300 GPU dedicated to them, and most output weird garble noises at unpredictable times along with the good output.
>I do a lot of desktop screen-reader and pdf/doc/epub/etc text to speech every single day.
I've been working a product called WithAudio (https://with.audio). Are you open to me reaching out and give a free license so you can use it and let me know what you think ? I should say it only supports Windows and Mac(arm).
I'm looking for people who have used similar products to get their feedback.
It looks like the inline Apple Podcasts player is causing that, though it's not clear why since it's loading unencrypted MP3s directly from the authors S3 bucket. I guess their player eagerly sets up DRM playback at startup rather than waiting until it's needed, or they're using EME for something else (fingerprinting?).
The Gemini models and Eleven V3, and whatever internal audio model Sora 2 uses are about neck and neck in converging performance. They have some unexplainable flavor to them though. Especially Sora.
While it sounds like this blogger doesn't want to bother (and perhaps experimenting with AI is itself the appeal), I personally appreciate when authors read their posts instead of delegating the task to AI.
If this demo video[1] is indicative of what you can expect, I'm not particularly impressed. For me, every single one of the recordings fell all the way to the bottom of the uncanny valley.
It's very interesting to see that there actually are people who want to automatically create a "podcast" from their blog using their cloned voice. Is this just what tech bro culture does to someone? Or is it about hustling and grinding while getting your very important word out there. I mean over time one would certainly save up to 20 minutes for each article...
Superhuman TTS is well within the capabilities of the big AI labs. Even Google had voice indistinguishable from human back in 2017, but they deliberately kneecapped it because of the potential for misuse. Boomers and older folks are not culturally or mentally equipped to handle it - even the crappy open source voice cloning we had in 2019 got used to scam people into buying gift cards.
Because of the potential for abuse, nobody wants to release a truly good, general model, because it makes lawyers lose sleep. A few more generations of hardware, though, and there will be enough open data and DIY scaffolding out there to produce a superhuman model, and someone will release it.
Deepfake video is already indistinguishable from real video (not oneshot prompt video generation, but deliberate skilled craft using AI tools.)
Higgsfield and other tools allow for spectacular voice results, but it takes craft and care. The oneshot stuff is deliberately underpowered. OpenAI doesn't want to be responsible for a viral pitch-perfect campaign ad, or fake scandal video sinking a politician, for example.
Once the lawyers calm down, or we get a decent digital bill of rights that establishes clear accountability on the user of the tool, and not the toolmaker, things should get better. Until then, look for the rogue YOLO boutique services or the ambitious open source crew to be the first to superhuman, widely available TTS.
> Boomers and older folks are not culturally or mentally equipped to handle it
I think a lot of younger people are also not mentally equipped to handle it. Outside of the hackernews sphere of influence, people are really bad at spotting AI slop (and also really bad at caring about it)
> Boomers and older folks are not culturally or mentally equipped to handle it
I'm glad you mentioned this because the "Grandma - I was arrested and you need to send bail" scams are already ridiculously effective to run. Better TTS will make voice communication without some additional verification completely untrustworthy.
But, also, I don't want better TTS. I can understand the words current robotic TTS is saying so it's doing the job it needs to do. Right now there are useful ways to use TTS that provide real value to society - better TTS would just enable better cloaking of TTS and allow actors to more effectively waste human time. I would be perfectly happy if TTS remained at the level it is today.
[+] [-] Karrot_Kream|3 months ago|reply
I've heard a lot of Substacks voiced by Eleven Labs models and they seem fine (with the occasional weirdness around a proper noun.) Not a bad article but I think more examples of TTS usage would be more useful.
I guess the outcome is, open weight TTS models are only okay and could be a lot better?
[+] [-] regulation_d|3 months ago|reply
[+] [-] TheAceOfHearts|3 months ago|reply
[+] [-] huskyr|3 months ago|reply
[1]: https://github.com/resemble-ai/chatterbox
[+] [-] unknown|3 months ago|reply
[deleted]
[+] [-] AlienRobot|3 months ago|reply
That's such a strange requirement. A TTS is just that. It takes a text and speaks it out loud. The user generally doesn't care whose voice it is, and personally I think TTS's sharing the same voice is a good thing for authenticity since it lets users know that it's a TTS reading the script and not a real person.
You want your voice to be reading the script, but you don't want to personally record yourself reading the text? As far as I'm concerned that's an edge case. No wonder that TTS's can't do that properly since most people don't need that in first place.
[+] [-] neilv|3 months ago|reply
That's a good rule.
> You must enable DRM to play some audio or video on this page.
Looks like `embed.podcasts.apple.com` isn't in the same spirit.
[+] [-] superkuh|3 months ago|reply
I do a lot of desktop screen-reader and pdf/doc/epub/etc text to speech every single day. It's been 20 years and I still use Festival 1.96 TTS with voice_nitech_us_slt_arctic_hts voice because it's so computational cheap and just slightly a step about normal festival/espeak/mbrolla/etc type TTS quality to be clear and tolerable. In terms of this local "do screenreader stuff really fast" use case I've tried modern TTS like vibevoice, kokoro tts, sherpa-onx, piper tts, orpheus tts, etc. And they all have consistency issues, many are way too slow even with a $300 GPU dedicated to them, and most output weird garble noises at unpredictable times along with the good output.
[+] [-] OfflineSergio|3 months ago|reply
I've been working a product called WithAudio (https://with.audio). Are you open to me reaching out and give a free license so you can use it and let me know what you think ? I should say it only supports Windows and Mac(arm). I'm looking for people who have used similar products to get their feedback.
[+] [-] derac|3 months ago|reply
[+] [-] mcny|3 months ago|reply
[+] [-] jsheard|3 months ago|reply
[+] [-] andrewstuart|3 months ago|reply
There’s flashes of brilliance but most of it is noticeably computer generated.
[+] [-] horhay|3 months ago|reply
[+] [-] actuallyalys|3 months ago|reply
[+] [-] lielvilla|3 months ago|reply
Totally agree on the pain points - I covered similar thoughts in my post: https://lielvilla.com/blog/death-of-demo/
[+] [-] lxe|3 months ago|reply
[+] [-] meatmanek|3 months ago|reply
1. https://github.com/user-attachments/assets/0fd73fad-097f-48a...
[+] [-] fleshmonad|3 months ago|reply
[+] [-] bigfishrunning|3 months ago|reply
Also, I suspect these AI-Podcast blogs are probably just generated with AI too, so it's likely safe to skip the whole mess
[+] [-] skybrian|3 months ago|reply
Maybe that's not so important?
[+] [-] RLAIF|3 months ago|reply
[deleted]
[+] [-] observationist|3 months ago|reply
Because of the potential for abuse, nobody wants to release a truly good, general model, because it makes lawyers lose sleep. A few more generations of hardware, though, and there will be enough open data and DIY scaffolding out there to produce a superhuman model, and someone will release it.
Deepfake video is already indistinguishable from real video (not oneshot prompt video generation, but deliberate skilled craft using AI tools.)
Higgsfield and other tools allow for spectacular voice results, but it takes craft and care. The oneshot stuff is deliberately underpowered. OpenAI doesn't want to be responsible for a viral pitch-perfect campaign ad, or fake scandal video sinking a politician, for example.
Once the lawyers calm down, or we get a decent digital bill of rights that establishes clear accountability on the user of the tool, and not the toolmaker, things should get better. Until then, look for the rogue YOLO boutique services or the ambitious open source crew to be the first to superhuman, widely available TTS.
[+] [-] fortran77|3 months ago|reply
[+] [-] bigfishrunning|3 months ago|reply
I think a lot of younger people are also not mentally equipped to handle it. Outside of the hackernews sphere of influence, people are really bad at spotting AI slop (and also really bad at caring about it)
[+] [-] rurban|3 months ago|reply
[+] [-] munk-a|3 months ago|reply
I'm glad you mentioned this because the "Grandma - I was arrested and you need to send bail" scams are already ridiculously effective to run. Better TTS will make voice communication without some additional verification completely untrustworthy.
But, also, I don't want better TTS. I can understand the words current robotic TTS is saying so it's doing the job it needs to do. Right now there are useful ways to use TTS that provide real value to society - better TTS would just enable better cloaking of TTS and allow actors to more effectively waste human time. I would be perfectly happy if TTS remained at the level it is today.
[+] [-] jsheard|3 months ago|reply
[+] [-] imiric|3 months ago|reply
So just feed it batches smaller than 1000 characters? It's not like TTS requires maintaining large contexts at a time.
[+] [-] simlevesque|3 months ago|reply