AI speech generator 'reaches human parity' – but it's too dangerous to release

jd115|1 year ago

I'm old enough to remember some number of months ago when GPT2 was described as "too dangerous to release".

Remember when the PlayStation 2 was "technically a supercomputer" and taking LSD a certain number of times made you insane? Great moments in marketing history

Paul-Craft|1 year ago

Huh. Today, a college kid without even any underlying knowledge of the math can train their own GPT-2-level language model as a semester project.

devmor|1 year ago

I would say that this take was correct, just not in the way the detractors at the time intended. The danger was to the usefulness of the internet.

I have yet to see any benefit to society from GPT's improvements, but I do see the internet quickly becoming more and more unusable due to the inundation of machine-generated spam on nearly every communications platform.

rmorey|1 year ago

“some number” would be 65. things change

thevillagechief|1 year ago

Ah, the old "it's too dangerous to release" marketing move. Why even tell us about it?

cuddlyogre|1 year ago

For the extra dollars they'll obtain when it is no longer too dangerous to release.

HeatrayEnjoyer|1 year ago

Not everything is a conspiracy. It is very plausible they're speaking genuinely.

htrp|1 year ago

https://arxiv.org/pdf/2406.05370

The model in question is Microsoft Vall-E2 without the click bait headline.

rentonl|1 year ago

Of course, this technology must only stay in the hands of our trusted corporate overlords.

devmor|1 year ago

Unless of course, you're willing to pay a monthly subscription fee for access to it. Then the danger has been mitigated.

carapace|1 year ago

https://arxiv.org/abs/2406.05370

> Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases

https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.

If you go back and look at older cities they almost all have the same pattern: walls and gates.

I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?

At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P

chx|1 year ago

Things are progressing just as https://youtu.be/xoVJKj8lcNQ predicted.

> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win

We saw already AI voices influencing elections in India https://restofworld.org/2023/ai-voice-modi-singing-politics/

> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.

BadHumans|1 year ago

Unfortunately in the US there is a political party that is attacking education and doesn't want people to learn critical thinking skills at a time when critical thinking is sorely needed. They happen to really "love the poorly educated" for some reason.

Liquix|1 year ago

classified tech is generally at least 10 years ahead of anything the public has access to. judging by how bizarre and polarizing the previous two US elections have been, i wouldn't be surprised if this prediction had already played out and we just didn't know it yet

bitshiftfaced|1 year ago

Too dangerous for PR reasons at least until after November.

AnimalMuppet|1 year ago

If November is the reason, well, this isn't the last November with that concern...

dspillett|1 year ago

People saying “too dangerous to release” usually means one (or more) of three things:

1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!

2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.

3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.

abroadwin|1 year ago

Relevant research post from Microsoft: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

digitalsushi|1 year ago

weakest link, if they dont release this, someone else will release one. every time someone noble invents another gen ai toy/weapon, they lock it down with post filters so it cant be used for evil, and then a second person forks it, pops the safeties off, and tells the world to go nuts.

social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.

cuddlyogre|1 year ago

If you can't trust the mail, the phone, or your computer, make them come to your house with the sheriff.

It's almost as if a consumer protection agency should be created and funded to protect consumers.

spywaregorilla|1 year ago

I really want something that can do a voice change and match the emotion and articulation of a voice clip that I provide. I don't care (or want) it to be based off a real person and the manners in which they would tend to articulate a sentence. Are there any decent open models out there?

woodson|1 year ago

Try StyleTTS2. You will still have to experiment with the settings a little to get the right level of adherence to the reference speaker’s voice and the emotion content.

pphysch|1 year ago

Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.

michaelbuckbee|1 year ago

This was my thought as well, but someone pointed out to me that regional accent identification captures a large percentage of cadence and inflection differences (specific word choices and turns of phrase obviously would still not be there).

Melomololotolo|1 year ago

I don't think it's hard to get more than a few seconds of voice from many people.

'hi, sry to call you I'm Cindy and I'm from your insurance. I'm calling regarding your car crash ...'.

criddell|1 year ago

Few seconds means less than a minute. That’s not nothing. Look at a clock and talk for a minute — it’s longer than you might think.

Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.

ChrisArchitect|1 year ago

Project page: https://www.microsoft.com/en-us/research/project/vall-e-x/va...

> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."

exe34|1 year ago

I too have an agi in my basement, but it's too dangerous to release! wanna give me some cash?

coeneedell|1 year ago

These samples are terrible when compared to commercially released models like from eleven labs or playht. This is an extension of an interesting architecture but currently those more traditionally based models are way more convincing.

BoredPositron|1 year ago

I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.

mensetmanusman|1 year ago

A gun can help rob a bank.

A speech generator can help rob 1000 banks.

exe34|1 year ago

it might help if the dumb dumbs at the bank would stop trying to make me say "my voice is my password". I've been careful to only say "no fcuk off you fcuking numpty who came up with this idea after voice cloning hit the mainstream".

AlexDragusin|1 year ago

"Too dangerous to release" - I could use that line to promote my services :)

mcpar-land|1 year ago

I can believe a speech generator too good to release, but not even a perfect algorithm can get every one of your inflections and verbal tics with just a few seconds of sample material. Makes me think the whole thing is bs. I instantly see any "ooh our thing we are making on purpose is so dangerous oohhh" as an attempt at regulatory capture until I see proof of the danger.

hi_dang_|1 year ago

The classic Steven Seagal “my hands are weapons and I need a license for them” rhetoric. What a crock of shit.

unknown|1 year ago

[deleted]

thedonkeycometh|1 year ago

[deleted]

Slyfox33|1 year ago

Riiiight

zazazache|1 year ago

What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.

Truly irresponsible

rhdunn|1 year ago

There are legitimate uses of this tech, such as preserving voices of people losing them such as Stephen Hawking, or making it better for blind/low vision people to follow text and interact with devices. For that latter case having a more natural voice that is also accurate is a good thing.

I use TTS to listen to articles and stories that don't have access to an audiobook narrator. I've used some of the voices based on MBROLA tech, but those can grate after a while.

The more recent voice models are a lot higher quality and emotive (without the jarring pitch transitions of things like Cepstral) so are better to listen to. However, the recent models can clip/skip text, have prolonged silence, have long/warped/garbled words, etc. that make them harder to use longer term.

Paul-Craft|1 year ago

You're right, of course. Unfortunately, however, we're all just actors in a giant, multi-player, iterated Prisoner's Dilemma here. If I decide not to pursue human-level automated speech generation, or I end up developing it and don't release because it's "too dangerous," someone else will just come in behind me and take all that market share I could have captured.

It's like we're stuck in some movie that came out in 1994[0], or something. Except, in this version, everything is gonna up sooner or later, anyway. Might as well profit from it along the way, right?

Le sigh.

---

[0]: https://www.imdb.com/title/tt0111257/

ninjanomnom|1 year ago

At least one good use is for video games where the text of some dialogue is determined when you run the game. For example in a game I work on player chat is local and voiced by tts configured by the player for their character.

HeatrayEnjoyer|1 year ago

Move fast and break things (including organized society).

I can't even think of non malicious uses that are anything more than novelty or small conveniences. Meanwhile the malicious use cases are innumerable.

In a just world building this would be a severe felony, punished with prison and destruction of all of the direct and indirect source material.

mensetmanusman|1 year ago

Cancer that takes someone’s voice.

pragma_x|1 year ago

Agreed.

On the one hand, I would love this kind of tech to be available for entertainment purposes. An RPG with convincing NPCs that are able to provide a novel experience for every player? Sounds great.

On the other: this is fraught with ethical problems, not to mention an ideal tool for fraud. At worst, it could be used as a weapon for total asymmetrical warfare on concepts like media integrity and an ideal tool for character assassination; disinformation, propaganda, etc.

I would happy welcome a world where this stuff is nerfed across the board, where videogames and porn are just chock full of AI voice-acting artifacts. We'll adjust and accept that as just a part of the experience, as we have with low fidelity media of the past. But my more cynical side tells me that's not what people in power are concerned about.

ryandrake|1 year ago

This is what happens when you have an industry full of people "looking for challenging problems to solve" without an ethical foundation to warn them that just because you can build something doesn't mean you should.

unraveller|1 year ago

The point is to spawn a new medium, you'll have to imagine harder how positive that could be as people with lots of ideas are not going to give them to you for free.

Perfecting the tech for wide-spread use has trade offs; need for caller id, ease of slandering until trust in voice uniqueness recalibrates, all of which is going to change soon anyway but giving only rich/bad actors the tech at first has its own set of trade offs. Head in the sand is the irresponsible way.

64 comments