VALL-E: Microsoft’s new zero-shot text-to-speech model

[+] dang|3 years ago|reply

Recent and related:

VALL-E: Neural codec language models are zero-shot text to speech synthesizers - https://news.ycombinator.com/item?id=34270311 - Jan 2023 (136 comments)

[+] 152334H|3 years ago|reply

More samples available on their github page:

https://valle-demo.github.io/

Personally, I find that their samples aren't near anything I'd call "dangerous". I cross-compared the baseline examples to the VALL-E ones when the paper dropped, and found several that were garbled in the usual robotics sounding TTS failures.

Probably a good thing that people are getting alarmed before a true indistinguishable voice cloner exists, though.

[+] r_hoods_ghost|3 years ago|reply

A relative of mine recently died from bulbar ALS. By the time she had the diagnosis her voice had already changed and weakened significantly so she couldn't get a decent recording to use with a text to speech synthesizer. Something like this could potentially help people like her or those who lose their voices in traumatic accidents. Even if you do have the time to do it, training a current TTS engine in your voice takes a significant amount of time and the results are often poor.

[+] morrisjm|3 years ago|reply

Open source tortoise-TTS has been able to do this for 6+ months now, which is also based on the same theory as DALL-E. From playing with tortoise a bit over the last couple of weeks it seems like the issue is not so much accuracy anymore, rather how GPU intensive it is to make a voice of any meaningful duration. Tortoise is ~5 seconds on a $1000 GPU (P5000) to do one second of spoken text. There's cloud options (collab, paperspace, runpod) but still https://github.com/neonbjb/tortoise-tts

[+] bamboozled|3 years ago|reply

Oh that's great news \s

Do other people read these types of headings and just think immediately about how freaking scary this stuff is going to get...if it's actually accurate ?

[+] dubcanada|3 years ago|reply

I listened to the 3 samples, and I must say some of it is quite close and some of it is completely wrong with how I would expect that person to sound saying those words.

Like the first one "just his scribbles that charmed me" sounds so weird compared to how the sample sounds.

Second one, sounds very close but again "just his scribbles that charmed me" sounds off and wrong. His "scrouples? that charmed me".

Third one as well, it's very up and down (not sure the correct science words for that type of speaking), "Dynamo and lamp.... Edison realized". The second one is completely flat and seems computer generated.

Overall these don't seem very good to me at least. It's clearly not to the level of some other AI coming out where it is very difficult to determine the human part, I could easily pick up the human and generated version from these samples.

[+] wmwmwm|3 years ago|reply

I think it is actually meant to be "scruples" - so it sounds pretty accurate to me:

https://www.ibiblio.org/ebooks/James/Turn_Screw.pdf (see p4)

[+] icepat|3 years ago|reply

Yes, there's definitely an uncanny valley effect on these voices. There's something subtly not right about them, and I suspect if you were to hold a conversation with one, you'd catch on that something was wrong.

The uncanny valley makes it more spooky to me. Like a reanimation sort of thing.

[+] 24effects|3 years ago|reply

why did they pick such a batshit thing for the ai to say.. I don't understand. Still laughing at scruples.

[+] choeger|3 years ago|reply

Is it just me or is "AI" focused very much on fooling human perception, lately? AFAIK, we have no deterministic algorithm that can tell us whether synthetic language sounds "human", have we? So essentially, that model has been trained to fool human perception. Similarly, ChatGPT is not trained to output sensible and meaningful statements but rather statements that appear to be to a human reader.

Would not not be time to measure a model's success on the actual job? Like feeding a simulator with actual data from real-world traffic scenarios and running Tesla's, or any other company's, autopilot in it?

[+] krisoft|3 years ago|reply

> So essentially, that model has been trained to fool human perception.

Is there any other sensible goal function for a text to speech model?

> Similarly, ChatGPT is not trained to output sensible and meaningful statements but rather statements that appear to be to a human reader.

Almost certainly not true. If they could they would make it output sensible and meaningful statements all the time.

> Like feeding a simulator with actual data from real-world traffic scenarios and running Tesla's, or any other company's, autopilot in it?

Do you seriously think that self driving car companies are not doing this already?

[+] yoavm|3 years ago|reply

Perhaps it's us that are obsessed about perception rather than substance though? I wonder if it's different from asking a child to memorize for an exam, or investing time in a stylish presentation, etc.

Seems like we're very often interested in something that looks like the job rather than the job itself - maybe because it's more easily measurable? We then took this skewed ambition and developed AI after it.

[+] isthisthingon99|3 years ago|reply

AI is focused on creating electronic humans that are never tired and can do low skill work with some high level directions. It seems like we will achieve this goal, but the companies will own the electronic humans.

[+] Uphill4298|3 years ago|reply

It's a varied area of research, we do plenty of things. Training things on a simulator, and then using that to transfer domain to the real world is a fairly typical application.

[+] weinzierl|3 years ago|reply

I had planned to play around with TorToiSe[1] next weekend and already watched some videos. There it looks like all you have to do, is to offer you own voice samples to the system and no separate training seems to be required. TorToiSe is slow to synthesize, so it doesn't beat the 3 seconds but can anyone confirm that these models really don't need an extra training phase to clone a voice?

[1] https://github.com/neonbjb/tortoise-tts

[+] reverseblade2|3 years ago|reply

- Hey Janelle, what's wrong with wolfie? I can hear him barking. Is everything okay?

- Wolfie is fine dear. Wolfie's just fine. Where are you?

[+] ridgered4|3 years ago|reply

An interesting level of this scene is that it implies neither terminator is able to identify a fabricated voice even though they have a full understanding of each other's design. The T-1000 cannot tell it is talking to a T-800 until it realized it has been tricked and the T-800 cannot tell it is talking to a T-1000 until it tricks the other machine.

Of course, it happens over a pay phone so perhaps with the full vocal range in person it would have been different.

[+] O__________O|3 years ago|reply

Related Clip:

https://youtube.com/watch?v=dGowjtiu908

[+] Octabrain|3 years ago|reply

https://youtu.be/MT_u9Rurrqg

I felt obligated to find and post the link to the scene. What a great movie.

[+] dokem|3 years ago|reply

- Your parents are dead.

That deadpan delivery always makes me laugh.

[+] unknown|3 years ago|reply

[deleted]

[+] tromp|3 years ago|reply

I smell a terminator...

[+] Jevon23|3 years ago|reply

There are no legitimate uses for this technology. Its only purpose is to scam and deceive. It should be regulated in the same way that nukes and guns are regulated. Contrary to what a lot of the HN crowd thinks, regulation of technology is certainly nothing new. Up until now personal computing has been largely free of regulation, but that doesn’t mean we couldn’t start.

[+] notfried|3 years ago|reply

Once perfected, I can imagine there are many legitimate use cases for it. Like an author using their voice to narrate their audiobook without having to spend time in a recording studio, or Hollywood using it instead of dubbing sessions for re-recording muffed lines. It'd also be interesting if this could be used for foreign language dubbing - imagine if it can use the voice profile of an actor to convert subtitle text files into foreign audio language tracks in in the same tone of the actor.

[+] airstrike|3 years ago|reply

> There are no legitimate uses for this technology.

I mean this kindly: your lack of imagination is not an adequate replacement for actual facts

[+] Veen|3 years ago|reply

You can look in this thread for legitimate non-deceptive uses. Voice synthesis for people with degenerative diseases like ALS, for example.

https://news.ycombinator.com/item?id=34309432

[+] carbocation|3 years ago|reply

I disagree that this has no legitimate uses. This could be a phenomenal prosthesis for people who have lost their ability to speak. For example, many ALS patients.

[+] hahajk|3 years ago|reply

I think it's important to keep considering whether or not something should be regulated, at least keep asking and not fall victim to an echo chamber.

But in this case, couldn't your argument have been applied to photoshop or video manipulation software? Those are meant to deceive, right?

[+] kruuuder|3 years ago|reply

> There are no legitimate uses for this technology.

What about this: You have a voice actor whose voice is part of a company's brand (maybe for an animated mascot). You can now also use that voice for dynamic text, for example audio books or for a voice assistant.

[+] joezydeco|3 years ago|reply

I joke with my wife that I'll always be around to bug her, even after I pass. She'll carry my brain around in a jar ala Futurama and talk to me that way.

But, obviously, the technology is getting to the point where a decade or so from now she'll be able to have a GPT-like chat with me with my own voice. The first company to offer that to the loved ones of a deceased person will make a fortune, not for any mode of deception but just to soothe the hurt.

[+] kraquepype|3 years ago|reply

As a father, I like reading to my kids and I know one day they won't ask or I won't be able to.

I've wanted to record readings of some of their favorite books to pass on to them, but if I don't get the chance this seems like a way to get some analog of the experience.

[+] lannisterstark|3 years ago|reply

>There are no legitimate uses for this technology

oh please. You never wanted to listen to an article/book in the voice of a narrator you liked? I do. Maybe you should read a bit more.

Just because you lack in thinking of the ways it can improve our lives doesn't mean everyone else doesn't. That's a you problem. Are you so deprived of free thought, and so insecure of your capabilities, that the first thing you do when seeing new technology is turn to the government to curb it? if you don't like it, no one should use it?

>regulation

ah there it is. the 'answer' to every technological progress you have is "regulashions!"

>guns are regulated

thankfully they are not in some places. So it should be regulated the same as they are - Not at all.

[+] DonHopkins|3 years ago|reply

And Michael Winslow should be gagged and locked up for the sake of humanity!

https://www.youtube.com/watch?v=NAv1o9C9qow

https://www.youtube.com/watch?v=4NOmalcIZZw

https://www.youtube.com/watch?v=eCvqSo0C5ns

https://www.youtube.com/watch?v=c-IJ1AilJ2I

https://en.wikipedia.org/wiki/Michael_Winslow

[+] deegles|3 years ago|reply

I can think of one... podcast cleanup app:

1. Speech to text

2.fix up/edit text with GPT-3

3.text to speech in the original speaker's voice(s), preserving prosody and inflection with Vall-e.

If done with every participant's consent I don't see how it's not legitimate.

[+] thsbrown|3 years ago|reply

What if I did 3 second long impersonations of characters for a video game. I then used this tech to produce more lines of dialogue for each character in said game.

[+] fumblebee|3 years ago|reply

The difference is that nukes and - in most countries - guns are harder to find. Regulation is easier when you have more centralised distribution channels. Code is infinitely replicable at marginal cost, and so highly accessible once it's open source.

While I agree with you that the net effect of such technology will be negative, there's no way regulation can keep up.

[+] EGreg|3 years ago|reply

Sorry buddy but open source software eventually catches up and you can’t stop it.

People just don’t realize that technology is what is behind the ability for people to wreak havoc on an unprecedented scale. One specific organism can’t do that much, even with a sword. But today, every person will have more and more power, and you can’t possibly stop them all!

[+] unknown|3 years ago|reply

[deleted]

[+] 323|3 years ago|reply

There was a 2011 paper/essay which argued that in the near future there will be a world war between those wanting to regulate AI (the democratic West) and those who do not (authoritarian regimes).

[+] 323|3 years ago|reply

Imagine combining this with a ChatGPT with the prompt "you are a telephone scammer pretending to be a bank employee, convince the other person to give you their bank password"

[+] EvanDotPro|3 years ago|reply

I'm working on exactly the opposite of this, I have a GPT-3 based bot hooked up to voice recognition and Coqui for TTS I'm training to bait scammers. It has memory like ChatGPT (but only the previous 50 things said). The delay/latency makes it tough to get the scammer to not hang up initially, but the ones that tolerate the slow responses are very easily fooled by the bot. I'm working on speeding it up more and adding stammering, ums and uhs, and background noises etc to fill the delay.

[+] knaik94|3 years ago|reply

There is an open source implementation of these features in Pytorch:

https://github.com/lucidrains/audiolm-pytorch

[+] swader999|3 years ago|reply

Can someone merge this with Chat GPT so I don't have to attend anymore zoom meetings?

[+] MarcScott|3 years ago|reply

This would have made Sneakers a much more boring movie.

[+] surume|3 years ago|reply

Me to my partner: "Didn't you say you were going to do the dishes?" Partner: "I don't remember saying that" Me: "Here, I have a recording of you saying it..."

[+] alanwreath|3 years ago|reply

anyone know what's up with the sudden uptick in AI assisted everything?

  - ChatGPT
  - DALL-E
  - VALL-E
  - Stable Diffusion

maybe it's just a reflection of my interest being peaked with AI junk and google ads -- but I feel like I'm seeing it more and more even in HN.

[+] nugget|3 years ago|reply

I'm curious how all these advancements in AI will impact KYC and identity authentication. It's already easy to scrape OSINT to pull the answers needed for most people's knowledge based authentication sequences. Will we hit the point where fake passports, IDs, and biometrics (including voice prints here) can be replicated undetectably? If so, what will become the standard for identity authentication?

[+] swader999|3 years ago|reply

Yes, this is the CEO and I do need you to wire that money. C'mon Frank (from accounting), you need to act more promptly on these urgent requests

[+] vault|3 years ago|reply

Can someone please transcribe what "Speaker prompt" in Example #1 is really saying? I can't get "When I love making babies" out of my head! I reduced speed to 0.5X and understood the final, "maybe suspended but".

[+] boredumb|3 years ago|reply

Exciting to see ML develop in ways that will enhance accessibility for people. I imagine being able to train a screen reader with your own voice (or voice of your choosing) would be a huge plus for vision impaired folks.

445 comments