Launch HN: Play.ht (YC W23) – Generate and clone voices from 20 seconds of audio
Today, we are excited to share beta access to our latest model, Parrot, that is capable of cloning any voice with a few seconds of audio and generating expressive speech from text.
You can try it out here: https://playground.play.ht. And there are demo videos at https://www.youtube.com/watch?v=aL_hmxTLHiM and https://www.youtube.com/watch?v=fdEEoODd6Kk.
The model also captures accents well and is able to speak in all English accents. Even more interesting, it can make non-English speakers speak English while preserving their original accent. Just upload a non-English speaker clip and try it yourself.
Existing text to speech models either lack expressiveness, control or directability of the voice. For example, making a voice speak in a specific way, or emphasizing on a certain word or parts of the speech. Our goal is to solve these across all languages. Since the voices are built on LLMs they are able to express emotions based on the context of the text.
Our previous speech model, Peregrine, which we released last September, is able to laugh, scream and express other emotions: https://play.ht/blog/introducing-truly-realistic-text-to-spe.... We posted it to HN here: https://news.ycombinator.com/item?id=32945504.
With Parrot, we've taken a slightly different approach and trained it on a much larger data set. Both Parrot and Peregrine only speak English at the moment but we are working on other languages and are seeing impressive early results that we plan to share soon.
Content creators of all kinds (gaming, media production, elearning) spend a lot of time and effort recording and editing high-quality audio. We solve that and make it as simple as writing and editing text. Our users range from individual creators looking to voice their videos, podcasts, etc to teams at various companies creating dynamic audio content.
We initially built this product for ourselves to listen to books and articles online and then found the quality of TTS is very low, so we started working on this product until, eventually we trained our own models and built a business around it. There are many robotic TTS services out there, but ours allows people to generate truly human-level expressive speech and allows anyone to clone voices instantly with strong resemblance. We initially used existing TTS models and APIs but when we started talking to our customers in gaming, media production, and others, people didn't like the monotone robotic TTS style. So we doubled down in training a new model based on the new emerging architectures using transformers and self supervised learning.
On our platform, we offer two types of voice cloning: high-fidelity and zero-shot. High-fidelity voice cloning requires around 20 minutes of audio data and creates an expressive voice that is more robust and captures the accent of the target voice with all its nuances. Zero-shot clones the voice with only a few seconds of audio and captures most of the accent and tone, but isn’t as nuanced because it has less data to work with. We also offer a diverse library of over a hundred voices for various use cases.
We offer two ways to use these models on the platform: (1) our text to voice editor, that allows users to create and manage their audio files in projects, etc.; and (2) our API - https://docs.play.ht/reference/api-getting-started. The API supports streaming and polling and we are working on reducing the latency to make it real time. We have a free plan and transparent pricing available for anyone to upgrade.
We are thrilled to be sharing our new model, and look forward to feedback!
[+] [-] h1fra|3 years ago|reply
Just a few note on the UX:
- Recording your own voice should contain a script too, that could help increase the quality of the sampling because I struggled to say anything relevant.
- Recording again, there is no time so it's hard to say when it's okay to stop
- You enforce the checkbox "not [...] to generate any sexual content" yet you have a filter to display only nswf
- It doesn't work at all with non-english voices, maybe you can add a warning or a way to fine tune depending on the language?
- There is no way to delete a voice nor an account, that's a huge red flag especially when dealing with PII like this.
- An other person has said it already, but generated voices are identified by an Auto Increment, making it easy to access PII of an other person. I would recommend at the very least a random string or an UUID
- All generated voices are public and no way to delete them
[+] [-] jeroenhd|3 years ago|reply
Going to the demo page and hearing a random snippet of Musk-worship was pretty weird. Out of all audio tracks to place at the top of your demos, you chose this?
[+] [-] nico|3 years ago|reply
Warning to others wanting to click on the link: damn that was creepy.
[+] [-] mugr|3 years ago|reply
[+] [-] yreg|3 years ago|reply
[+] [-] bongobingo1|3 years ago|reply
[+] [-] WakoMan12|3 years ago|reply
[+] [-] delgaudm|3 years ago|reply
What is your process for verifying consent?
[+] [-] ros86|3 years ago|reply
[+] [-] 1xdevloper|3 years ago|reply
[+] [-] mikecoles|3 years ago|reply
[+] [-] anigbrowl|3 years ago|reply
Are you though? You might just be computer-generated.
While I'm very impressed with this technically (and as a pro-audio person I feel validated to see my predictions of a few years back coming true so dramatically), I don't see anything about risk management in here. Your tech absolutely will get used by scammers, given the overabundance of voice data on the open internet. How are you going to hedge against that?
[+] [-] mahmoudfelfel|3 years ago|reply
[+] [-] mmkos|3 years ago|reply
[+] [-] Natfan|3 years ago|reply
https://playground.play.ht/listen/1079 (https://archive.ph/HKjue)
How exactly do you expect to combat this type of content?
[+] [-] hammadh|3 years ago|reply
[+] [-] bradleysz|3 years ago|reply
[+] [-] thisOtterBeGood|3 years ago|reply
[+] [-] spokeonawheel|3 years ago|reply
Just stop it before they can generate it.
However. Its just a matter of time so I wouldnt put it on the author to stop this kind of stuff. The only defense is education.
[+] [-] MuffinFlavored|3 years ago|reply
Pops a modal: Try Voice Cloning for Free!
Enter a credit card for $0.00/mo with no other information on screen
Bounce.
Why not let me play around with it a little without asking for a credit card?
[+] [-] joshmn|3 years ago|reply
My mom passed away a few years ago. I always let her calls go to my voicemail so I could have them. I was using Google Voice at the time so this worked wonderfully. Unfortunately, I will not listen to many of them — she was an alcoholic and I can't bear to listen to her while drunk. The few I have of her when she's sober I listen to occasionally.
Having said, this is really nice.
[+] [-] testmasterflex|3 years ago|reply
[+] [-] gwerbret|3 years ago|reply
[+] [-] nsxwolf|3 years ago|reply
[+] [-] tanepiper|3 years ago|reply
You've just launched in beta, how can you claim this? I'm always very suspicious of this (I take this from the position of being a tech lead at a multi-billion euro retailer who's logo you'll never be able to use)
Is this one developer? A team? Or is this just marketing bullshit for VCs who somehow don't verify if this is true or not?
[+] [-] nanis|3 years ago|reply
[+] [-] JohnFen|3 years ago|reply
Get a panicky call from "me" in the middle of the night? If I don't include my safe word, that call isn't from me.
[+] [-] gus_massa|3 years ago|reply
It's not very important that the voice is similar to the supposed victim. Usually the person in the call is weeping and it's very difficult to recognize the voice. Moreover a confusing voice at 2am may be interpreted as any of your relatives or friends, but an exact voice can be interpreted only as one and it's easier to know that that person is safe.
[+] [-] mahmoudfelfel|3 years ago|reply
[+] [-] TheUndead96|3 years ago|reply
[+] [-] mlboss|3 years ago|reply
[+] [-] jascii|3 years ago|reply
[+] [-] selflesssieve|3 years ago|reply
[+] [-] barking_biscuit|3 years ago|reply
[+] [-] Dowwie|3 years ago|reply
People here are talking about taking this service offline but I think everyone needs to be thinking about countermeasures, working on those services next. The genie is already out of the bottle. The degree of effort to put this together is low enough that it will be replicated around the world.
[+] [-] KaiserPro|3 years ago|reply
> "Hi Mom, I need some help. Some guys hit me over the head and put me in a van, and they're saying they'll kill me if you don't wire money to this bank account."
top class.
EDIT this was about one page down on the "see what people are generating" page
[+] [-] braingenious|3 years ago|reply
> I recommend you immediately add identity verification (state-issued identification verification)
and
> The genie is already out of the bottle. The degree of effort to put this together is low enough that it will be replicated around the world.
are thoughts that end up in the same post?
If the genie is out of the bottle, it’s your proposed solution that everybody that runs a model like this implements bank-style KYC?
What do you propose should happen when this sort of software becomes freely available for everyone? When (not if) that happens, what will your suggestion have accomplished?
[+] [-] twodave|3 years ago|reply
We're in for some really strange times.
[+] [-] abirch|3 years ago|reply
Play.ht will want to have as much information as possible about their users.
[+] [-] digitallyfree|3 years ago|reply
[+] [-] perlwle|3 years ago|reply
https://www.businessinsider.com/couple-canada-reportedly-los...
[+] [-] hammadh|3 years ago|reply
[+] [-] gsich|3 years ago|reply
[+] [-] chatmasta|3 years ago|reply
The solution to scams is to educate people on scams, as quickly as you can do so in the changing environment, by publishing information about what's possible with the latest technology. The solution is not to require onerous identity verification for every software product that could be used by scammers, because they'll just move to the next product that doesn't require it, or they'll simply provide fraudulent documents. Or you'll get "resellers" who provide their own fraudulent KYC documents and then sell access to their account to other criminals on the black market, making it even more difficult to monitor for abuse.
If you want a startup offering such tools to protect people from scams, they can do it by collecting data on what the tools are used for - it should be pretty obvious based on transcripts who is using it to scam people.
[+] [-] godDLL|3 years ago|reply
Got out all kinds. W's, v's, whatnot.
https://playground.play.ht/listen/18373
[+] [-] devmunchies|3 years ago|reply
I also like that the cloud provider supports SSML and I can explicitly configure the emotion, whereas Playht dynamically changed the emotion based on context of the text.
[+] [-] ajani|3 years ago|reply
- I used "create voice", and the page refused to allow me to create anything because the big green button at the bottom was disabled. Only 1 checkbox shows up out of the 3 labels, and I was unable to check even the box that was visible. I used console tools to remove the disabled property from the element and it worked. (I'm using Safari so maybe it doesn't work there properly)
- The generated voice did not sound like me (I used my own voice). It did have some familiar tones, but not really.
- I fiddled with top-p, temperature and voice guidance, but the improvement was minuscule.
- Also recording the voice did not work (did record, but couldn't replay it to verify). So I recorded it on my computer and uploaded a file and that did work.
[+] [-] MattRix|3 years ago|reply
[+] [-] calvinmorrison|3 years ago|reply
[+] [-] girthbrooks|3 years ago|reply