top | item 35328698

Launch HN: Play.ht (YC W23) – Generate and clone voices from 20 seconds of audio

459 points| hammadh | 3 years ago | reply

Hey HN, we are Mahmoud and Hammad, co-founders of Play.ht, a text-to-speech synthesis platform. We're building Large Language Speech Models across all languages with a focus on voice expressiveness and control.

Today, we are excited to share beta access to our latest model, Parrot, that is capable of cloning any voice with a few seconds of audio and generating expressive speech from text.

You can try it out here: https://playground.play.ht. And there are demo videos at https://www.youtube.com/watch?v=aL_hmxTLHiM and https://www.youtube.com/watch?v=fdEEoODd6Kk.

The model also captures accents well and is able to speak in all English accents. Even more interesting, it can make non-English speakers speak English while preserving their original accent. Just upload a non-English speaker clip and try it yourself.

Existing text to speech models either lack expressiveness, control or directability of the voice. For example, making a voice speak in a specific way, or emphasizing on a certain word or parts of the speech. Our goal is to solve these across all languages. Since the voices are built on LLMs they are able to express emotions based on the context of the text.

Our previous speech model, Peregrine, which we released last September, is able to laugh, scream and express other emotions: https://play.ht/blog/introducing-truly-realistic-text-to-spe.... We posted it to HN here: https://news.ycombinator.com/item?id=32945504.

With Parrot, we've taken a slightly different approach and trained it on a much larger data set. Both Parrot and Peregrine only speak English at the moment but we are working on other languages and are seeing impressive early results that we plan to share soon.

Content creators of all kinds (gaming, media production, elearning) spend a lot of time and effort recording and editing high-quality audio. We solve that and make it as simple as writing and editing text. Our users range from individual creators looking to voice their videos, podcasts, etc to teams at various companies creating dynamic audio content.

We initially built this product for ourselves to listen to books and articles online and then found the quality of TTS is very low, so we started working on this product until, eventually we trained our own models and built a business around it. There are many robotic TTS services out there, but ours allows people to generate truly human-level expressive speech and allows anyone to clone voices instantly with strong resemblance. We initially used existing TTS models and APIs but when we started talking to our customers in gaming, media production, and others, people didn't like the monotone robotic TTS style. So we doubled down in training a new model based on the new emerging architectures using transformers and self supervised learning.

On our platform, we offer two types of voice cloning: high-fidelity and zero-shot. High-fidelity voice cloning requires around 20 minutes of audio data and creates an expressive voice that is more robust and captures the accent of the target voice with all its nuances. Zero-shot clones the voice with only a few seconds of audio and captures most of the accent and tone, but isn’t as nuanced because it has less data to work with. We also offer a diverse library of over a hundred voices for various use cases.

We offer two ways to use these models on the platform: (1) our text to voice editor, that allows users to create and manage their audio files in projects, etc.; and (2) our API - https://docs.play.ht/reference/api-getting-started. The API supports streaming and polling and we are working on reducing the latency to make it real time. We have a free plan and transparent pricing available for anyone to upgrade.

We are thrilled to be sharing our new model, and look forward to feedback!

458 comments

[+] h1fra|3 years ago|reply

Congrats on launching. People already made a lot of feedback on the product itself so I'll keep mine.

Just a few note on the UX:

- Recording your own voice should contain a script too, that could help increase the quality of the sampling because I struggled to say anything relevant.

- Recording again, there is no time so it's hard to say when it's okay to stop

- You enforce the checkbox "not [...] to generate any sexual content" yet you have a filter to display only nswf

- It doesn't work at all with non-english voices, maybe you can add a warning or a way to fine tune depending on the language?

- There is no way to delete a voice nor an account, that's a huge red flag especially when dealing with PII like this.

- An other person has said it already, but generated voices are identified by an Auto Increment, making it easy to access PII of an other person. I would recommend at the very least a random string or an UUID

- All generated voices are public and no way to delete them

[+] jeroenhd|3 years ago|reply

Listening to the demos I'm not entirely convinced by this (https://playground.play.ht/listen/189 was pretty funny). I wonder if this company will end up taking down (and subsequently pricing out most people using this tech for fun) arbitrary voice generation just like its competitors have so far.

Going to the demo page and hearing a random snippet of Musk-worship was pretty weird. Out of all audio tracks to place at the top of your demos, you chose this?

[+] nico|3 years ago|reply

> (https://playground.play.ht/listen/189 was pretty funny)

Warning to others wanting to click on the link: damn that was creepy.

[+] mugr|3 years ago|reply

Wow, I call to the team behind this. I really STRONGLY think you should at least implement some sort of URL stealthing. I'm not a Web Security expert, but it reminds me of a talk where some company just made medical records 'public' like this.

[+] yreg|3 years ago|reply

The demo page says 'Recently generated', you have listened to the last snippet someone made.

[+] bongobingo1|3 years ago|reply

I see a bright future for play.ht in the "pre-event" audiolog generation market. Somebody get ubisoft on the phone.

[+] WakoMan12|3 years ago|reply

AI can now generate youtube poops

[+] delgaudm|3 years ago|reply

How do you assert that the cloned voice has been truly permitted by the voice owner? I've had my voice cloned without my consent by other people using Descript and Eleven Labs.

What is your process for verifying consent?

[+] ros86|3 years ago|reply

When I tried this service previously, you had to read (out loud) something saying that you were giving consent.

[+] 1xdevloper|3 years ago|reply

It's mentioned in the second demo video that they have a strict process to prevent cases like yours. I think Descript started asking for identity verification after its service was abused. This one probably has a similar process too.

[+] mikecoles|3 years ago|reply

TIL, the Booth Junkie is on HN. Love your work, sir.

[+] anigbrowl|3 years ago|reply

Hey HN, we are Mahmoud and Hammad

Are you though? You might just be computer-generated.

While I'm very impressed with this technically (and as a pro-audio person I feel validated to see my predictions of a few years back coming true so dramatically), I don't see anything about risk management in here. Your tech absolutely will get used by scammers, given the overabundance of voice data on the open internet. How are you going to hedge against that?

[+] mahmoudfelfel|3 years ago|reply

We have many mitigations in place to increase the safety of this service, I mentioned some of that here https://news.ycombinator.com/item?id=35331310

[+] mmkos|3 years ago|reply

Wow, I haven't even thought of that. Imagine this being used together with a chatgpt equivalent. Scam rates are going to go through the roof.

[+] Natfan|3 years ago|reply

This is already being used for scams.

https://playground.play.ht/listen/1079 (https://archive.ph/HKjue)

How exactly do you expect to combat this type of content?

[+] hammadh|3 years ago|reply

The intention for this playground was to let people try the model. We actually have auto moderation on the user facing platform (https://play.ht/) and malicious text gets blocked and the user get flagged.

[+] bradleysz|3 years ago|reply

This is not a full solution, just spitballing, but I wonder how effective it would be to have a flagging system built with GPT4 where the prompt was some form of "This is text submitted to a text-to-voice model. Determine the probability that this is being used maliciously." Then manually review anything that returns >X%.

[+] thisOtterBeGood|3 years ago|reply

Same: https://playground.play.ht/listen/2002 (https://archive.ph/wip/hsaqU)

[+] spokeonawheel|3 years ago|reply

Sounds like old school AI, very similar to the spam problem google solved could easily take care of this.

Just stop it before they can generate it.

However. Its just a matter of time so I wouldnt put it on the author to stop this kind of stuff. The only defense is education.

[+] MuffinFlavored|3 years ago|reply

https://play.ht/app/voice-cloning > Clone a voice now

Pops a modal: Try Voice Cloning for Free!

Enter a credit card for $0.00/mo with no other information on screen

Bounce.

Why not let me play around with it a little without asking for a credit card?

[+] joshmn|3 years ago|reply

Playing with this now, wow.

My mom passed away a few years ago. I always let her calls go to my voicemail so I could have them. I was using Google Voice at the time so this worked wonderfully. Unfortunately, I will not listen to many of them — she was an alcoholic and I can't bear to listen to her while drunk. The few I have of her when she's sober I listen to occasionally.

Having said, this is really nice.

[+] testmasterflex|3 years ago|reply

Sorry man. :( I wish you well

[+] gwerbret|3 years ago|reply

Given the very (very, VERY) obvious concerns associated with malicious deployment of this tech, and the minimal/largely ineffectual countermeasures deployed by the founders, what surprises me the most is that YC gave this startup its stamp of approval. It used to be that they offered at least a basic sanity check to anything they funded. Is this now getting lost as they scale up their funding operations?

[+] nsxwolf|3 years ago|reply

This is going to be the shortest gold rush in history. Make your money now because in a couple years you'll be able to build and deploy your own Play.ht for free with a single ChatGPT prompt.

[+] tanepiper|3 years ago|reply

"Trusted by 7000+ users and teams of all sizes" [posts a bunch of company logos]

You've just launched in beta, how can you claim this? I'm always very suspicious of this (I take this from the position of being a tech lead at a multi-billion euro retailer who's logo you'll never be able to use)

Is this one developer? A team? Or is this just marketing bullshit for VCs who somehow don't verify if this is true or not?

[+] nanis|3 years ago|reply

What are the "legitimate" uses cases for this kind of service where they would expect to make money from individuals who want their voices cloned? Dub movies? Audiobooks?

[+] JohnFen|3 years ago|reply

This is a good reminder that we all need to have a "safe word" that we can use to verify to the important people in our life that the voice they may be hearing on the phone or elsewhere is really us.

Get a panicky call from "me" in the middle of the night? If I don't include my safe word, that call isn't from me.

[+] gus_massa|3 years ago|reply

That scam was popular here in Argentina a few years ago. We call it "virtual kidnaping" https://www.fbi.gov/news/stories/virtual-kidnapping , nobody is kidnaped, it's just a scam using a phone call.

It's not very important that the voice is similar to the supposed victim. Usually the person in the call is weeping and it's very difficult to recognize the voice. Moreover a confusing voice at 2am may be interpreted as any of your relatives or friends, but an exact voice can be interpreted only as one and it's easier to know that that person is safe.

[+] mahmoudfelfel|3 years ago|reply

Society definitely needs to adapt to this new norm; we are trying to roll this out as safely as possible, but others are not as careful, and this technology will just become more ubiquitous over time.

[+] TheUndead96|3 years ago|reply

It is frightening that we have gotten to this point already.

[+] mlboss|3 years ago|reply

Very good suggestion

[+] jascii|3 years ago|reply

I'm having a hard time coming up with a non-nefarious use case for this.

[+] selflesssieve|3 years ago|reply

I can’t wait for spoofed messages from my loved ones.

[+] barking_biscuit|3 years ago|reply

What we really need is something on par with this or Eleven Labs that's open source. Then the real fun will begin. At this point I think it's just a matter of time.

[+] Dowwie|3 years ago|reply

I recommend you immediately add identity verification (state-issued identification verification), set up appropriate secrets store for PII, and audit trail EVERYTHING your users are doing, storing the contents in a secure location. Yesterday. This service will be used to harm others, shortly. I do think that there are exciting, honest things that can be done with this service but you need to set up some friction for use. Know-your-customer rules are going to apply to this category in short time.

People here are talking about taking this service offline but I think everyone needs to be thinking about countermeasures, working on those services next. The genie is already out of the bottle. The degree of effort to put this together is low enough that it will be replicated around the world.

[+] KaiserPro|3 years ago|reply

Like this example here: https://playground.play.ht/listen/1554 which says:

> "Hi Mom, I need some help. Some guys hit me over the head and put me in a van, and they're saying they'll kill me if you don't wire money to this bank account."

top class.

EDIT this was about one page down on the "see what people are generating" page

[+] braingenious|3 years ago|reply

How is it that

> I recommend you immediately add identity verification (state-issued identification verification)

and

> The genie is already out of the bottle. The degree of effort to put this together is low enough that it will be replicated around the world.

are thoughts that end up in the same post?

If the genie is out of the bottle, it’s your proposed solution that everybody that runs a model like this implements bank-style KYC?

What do you propose should happen when this sort of software becomes freely available for everyone? When (not if) that happens, what will your suggestion have accomplished?

[+] twodave|3 years ago|reply

While I agree with you, the problem is far bigger than any one company in my opinion. These tools are already accessible enough to individuals that no audio or video is trustworthy, regardless of its source. I suspect we can still detect whether most faked audio/video is authentic or not algorithmically, but that's going to turn into an arms race eventually. And IMO none of the "answers" are ones that you really want to see made real, either.

We're in for some really strange times.

[+] abirch|3 years ago|reply

I'm imagining the legal implications though I'm not a lawyer. If granny gets ripped off by someone impersonating me with this site, seems like Granny could sue Play.ht.

Play.ht will want to have as much information as possible about their users.

[+] digitallyfree|3 years ago|reply

While verification could be done for a cloud service like this one, what's more concerning is that locally run models with this tech will be coming soon (think of LLAMA and Stable Diffusion). KYC is merely a stopgap and honestly we'll need effective solutions for detecting vocal cloning impersonation in the future.

[+] perlwle|3 years ago|reply

A couple in Canada were reportedly scammed out of $21,000 after getting a call from an AI-generated voice pretending to be their son.

https://www.businessinsider.com/couple-canada-reportedly-los...

[+] hammadh|3 years ago|reply

Couldn't agree more with your comment. We are working on counter measures like manual verification of voice, a classifier to detect cloned speech, etc. As of now we have auto moderation in place that detects and blocks hate/harmful speech.

[+] gsich|3 years ago|reply

Or it will be used for memes.

[+] chatmasta|3 years ago|reply

Gasp! Yawn. HN has become so pearl-clutchingly alarmist recently. Everybody relax.

The solution to scams is to educate people on scams, as quickly as you can do so in the changing environment, by publishing information about what's possible with the latest technology. The solution is not to require onerous identity verification for every software product that could be used by scammers, because they'll just move to the next product that doesn't require it, or they'll simply provide fraudulent documents. Or you'll get "resellers" who provide their own fraudulent KYC documents and then sell access to their account to other criminals on the black market, making it even more difficult to monitor for abuse.

If you want a startup offering such tools to protect people from scams, they can do it by collecting data on what the tools are used for - it should be pretty obvious based on transcripts who is using it to scam people.

[+] godDLL|3 years ago|reply

I put in "m m m m m m m m m m!"

Got out all kinds. W's, v's, whatnot.

https://playground.play.ht/listen/18373

[+] devmunchies|3 years ago|reply

How is the latency for real-time TTS? I remember kicking the tires several months back but went with one of the big 3 cloud providers since they had lower latency.

I also like that the cloud provider supports SSML and I can explicitly configure the emotion, whereas Playht dynamically changed the emotion based on context of the text.

[+] ajani|3 years ago|reply

I was excited to see this. My results were not convincing.

- I used "create voice", and the page refused to allow me to create anything because the big green button at the bottom was disabled. Only 1 checkbox shows up out of the 3 labels, and I was unable to check even the box that was visible. I used console tools to remove the disabled property from the element and it worked. (I'm using Safari so maybe it doesn't work there properly)

- The generated voice did not sound like me (I used my own voice). It did have some familiar tones, but not really.

- I fiddled with top-p, temperature and voice guidance, but the improvement was minuscule.

- Also recording the voice did not work (did record, but couldn't replay it to verify). So I recorded it on my computer and uploaded a file and that did work.

[+] MattRix|3 years ago|reply

I had this too, but the checkbox was just really small. My assumption was that they made it small so that people had to actually pay attention and read the text? But maybe it's just a bug haha

[+] calvinmorrison|3 years ago|reply

I had a weirder issue. When selecting the different tuning options there was no playback so I went to the final step and the voices joined in a cacophony all at once

[+] girthbrooks|3 years ago|reply

You should do the right thing and eradicate this immediately.