Remember when the PlayStation 2 was "technically a supercomputer" and taking LSD a certain number of times made you insane? Great moments in marketing history
I would say that this take was correct, just not in the way the detractors at the time intended. The danger was to the usefulness of the internet.
I have yet to see any benefit to society from GPT's improvements, but I do see the internet quickly becoming more and more unusable due to the inundation of machine-generated spam on nearly every communications platform.
> Our experiments
on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous
systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2
consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases
> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.
If you go back and look at older cities they almost all have the same pattern: walls and gates.
I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?
At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P
> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win
> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.
Unfortunately in the US there is a political party that is attacking education and doesn't want people to learn critical thinking skills at a time when critical thinking is sorely needed. They happen to really "love the poorly educated" for some reason.
classified tech is generally at least 10 years ahead of anything the public has access to. judging by how bizarre and polarizing the previous two US elections have been, i wouldn't be surprised if this prediction had already played out and we just didn't know it yet
People saying “too dangerous to release” usually means one (or more) of three things:
1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!
2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.
3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.
weakest link, if they dont release this, someone else will release one. every time someone noble invents another gen ai toy/weapon, they lock it down with post filters so it cant be used for evil, and then a second person forks it, pops the safeties off, and tells the world to go nuts.
social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.
I really want something that can do a voice change and match the emotion and articulation of a voice clip that I provide. I don't care (or want) it to be based off a real person and the manners in which they would tend to articulate a sentence. Are there any decent open models out there?
Try StyleTTS2. You will still have to experiment with the settings a little to get the right level of adherence to the reference speaker’s voice and the emotion content.
Speech generation has gotten really good, but there's simply no way to faithfully recreate someone's vocal idiosyncracies and cadence with just "a few seconds" of real audio. That's where the models tend to fall short.
This was my thought as well, but someone pointed out to me that regional accent identification captures a large percentage of cadence and inflection differences (specific word choices and turns of phrase obviously would still not be there).
Few seconds means less than a minute. That’s not nothing. Look at a clock and talk for a minute — it’s longer than you might think.
Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.
> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."
These samples are terrible when compared to commercially released models like from eleven labs or playht. This is an extension of an interesting architecture but currently those more traditionally based models are way more convincing.
I can't wait until the free base models get better. The floods on tiktok, shorts and stories with the standard eleven labs voice is getting nauseating.
it might help if the dumb dumbs at the bank would stop trying to make me say "my voice is my password". I've been careful to only say "no fcuk off you fcuking numpty who came up with this idea after voice cloning hit the mainstream".
I can believe a speech generator too good to release, but not even a perfect algorithm can get every one of your inflections and verbal tics with just a few seconds of sample material. Makes me think the whole thing is bs. I instantly see any "ooh our thing we are making on purpose is so dangerous oohhh" as an attempt at regulatory capture until I see proof of the danger.
What is the point of them trying to create this? That something like this would mostly be used to create disinformation and create chaos is easily understood before making something like this.
There are legitimate uses of this tech, such as preserving voices of people losing them such as Stephen Hawking, or making it better for blind/low vision people to follow text and interact with devices. For that latter case having a more natural voice that is also accurate is a good thing.
I use TTS to listen to articles and stories that don't have access to an audiobook narrator. I've used some of the voices based on MBROLA tech, but those can grate after a while.
The more recent voice models are a lot higher quality and emotive (without the jarring pitch transitions of things like Cepstral) so are better to listen to. However, the recent models can clip/skip text, have prolonged silence, have long/warped/garbled words, etc. that make them harder to use longer term.
You're right, of course. Unfortunately, however, we're all just actors in a giant, multi-player, iterated Prisoner's Dilemma here. If I decide not to pursue human-level automated speech generation, or I end up developing it and don't release because it's "too dangerous," someone else will just come in behind me and take all that market share I could have captured.
It's like we're stuck in some movie that came out in 1994[0], or something. Except, in this version, everything is gonna up sooner or later, anyway. Might as well profit from it along the way, right?
At least one good use is for video games where the text of some dialogue is determined when you run the game. For example in a game I work on player chat is local and voiced by tts configured by the player for their character.
On the one hand, I would love this kind of tech to be available for entertainment purposes. An RPG with convincing NPCs that are able to provide a novel experience for every player? Sounds great.
On the other: this is fraught with ethical problems, not to mention an ideal tool for fraud. At worst, it could be used as a weapon for total asymmetrical warfare on concepts like media integrity and an ideal tool for character assassination; disinformation, propaganda, etc.
I would happy welcome a world where this stuff is nerfed across the board, where videogames and porn are just chock full of AI voice-acting artifacts. We'll adjust and accept that as just a part of the experience, as we have with low fidelity media of the past. But my more cynical side tells me that's not what people in power are concerned about.
This is what happens when you have an industry full of people "looking for challenging problems to solve" without an ethical foundation to warn them that just because you can build something doesn't mean you should.
The point is to spawn a new medium, you'll have to imagine harder how positive that could be as people with lots of ideas are not going to give them to you for free.
Perfecting the tech for wide-spread use has trade offs; need for caller id, ease of slandering until trust in voice uniqueness recalibrates, all of which is going to change soon anyway but giving only rich/bad actors the tech at first has its own set of trade offs. Head in the sand is the irresponsible way.
jd115|1 year ago
01HNNWZ0MV43FF|1 year ago
Paul-Craft|1 year ago
devmor|1 year ago
I have yet to see any benefit to society from GPT's improvements, but I do see the internet quickly becoming more and more unusable due to the inundation of machine-generated spam on nearly every communications platform.
rmorey|1 year ago
thevillagechief|1 year ago
cuddlyogre|1 year ago
HeatrayEnjoyer|1 year ago
htrp|1 year ago
The model in question is Microsoft Vall-E2 without the click bait headline.
rentonl|1 year ago
devmor|1 year ago
carapace|1 year ago
> Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases
https://www.microsoft.com/en-us/research/project/vall-e-x/va...
> This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public.
If you go back and look at older cities they almost all have the same pattern: walls and gates.
I figure now that the Internet is a badlands roamed by robots pretending to be people as they attempt to rob you for their masters, we'll see the formation of cryptologically-secured enclaves. Maybe? Who knows?
At this point I'm pretty much going to restrict online communication to encrypted authenticated channels. (Heck, I should sign this comment, eh? If only as a performance?) Hopefully it remains difficult to build an AI that can guess large numbers. ;P
chx|1 year ago
> so 2024 will be the last human election and what we mean by that is not that it's just going to be an AI running as president in 2028 but that will really be although maybe um it will be you know humans as figureheads but it'll be Whoever greater compute power will win
We saw already AI voices influencing elections in India https://restofworld.org/2023/ai-voice-modi-singing-politics/
> AI-generated songs, like the ones featuring Prime Minister Narendra Modi, are gaining traction ahead of India’s upcoming elections. [...] Earlier this month, an Instagram video of Modi “singing” a Telugu love song had over 2 million views, while a similar Tamil-language song had more than 2.7 million. A Punjabi song racked up more than 17 million views.
BadHumans|1 year ago
Liquix|1 year ago
bitshiftfaced|1 year ago
AnimalMuppet|1 year ago
dspillett|1 year ago
1. “… but if you and your big rich company were to acquihire us you'd get access…” — though as this is MS it probably isn't that!
2. That is only works as well as claimed in specific circumstances, or has significant flaws, so they don't want people looking at it too closely just yet. The wordage “in benchmarks used by Microsoft” might point to this.
3. That a competitor is getting close to releasing something similar, or has just done so, and they don't want to look like they were in second place or too far behind.
abroadwin|1 year ago
digitalsushi|1 year ago
social solutions take too long to use against the tech, but tech solutions are fallible. to be defeatist about it, there's going to be a golden window of time here where some really nasty scams have no impedance.
cuddlyogre|1 year ago
It's almost as if a consumer protection agency should be created and funded to protect consumers.
spywaregorilla|1 year ago
woodson|1 year ago
pphysch|1 year ago
michaelbuckbee|1 year ago
Melomololotolo|1 year ago
'hi, sry to call you I'm Cindy and I'm from your insurance. I'm calling regarding your car crash ...'.
criddell|1 year ago
Do you think you could give a recording of a minute of someone talking to a talented impressionist and they could impersonate that person to some degree? It doesn’t seem that far fetched to me.
ChrisArchitect|1 year ago
> "This page is for research demonstration purposes only. Currently, we have no plans to incorporate VALL-E 2 into a product or expand access to the public."
exe34|1 year ago
coeneedell|1 year ago
BoredPositron|1 year ago
mensetmanusman|1 year ago
A speech generator can help rob 1000 banks.
exe34|1 year ago
AlexDragusin|1 year ago
mcpar-land|1 year ago
hi_dang_|1 year ago
unknown|1 year ago
[deleted]
thedonkeycometh|1 year ago
[deleted]
Slyfox33|1 year ago
zazazache|1 year ago
Truly irresponsible
rhdunn|1 year ago
I use TTS to listen to articles and stories that don't have access to an audiobook narrator. I've used some of the voices based on MBROLA tech, but those can grate after a while.
The more recent voice models are a lot higher quality and emotive (without the jarring pitch transitions of things like Cepstral) so are better to listen to. However, the recent models can clip/skip text, have prolonged silence, have long/warped/garbled words, etc. that make them harder to use longer term.
Paul-Craft|1 year ago
It's like we're stuck in some movie that came out in 1994[0], or something. Except, in this version, everything is gonna up sooner or later, anyway. Might as well profit from it along the way, right?
Le sigh.
---
[0]: https://www.imdb.com/title/tt0111257/
ninjanomnom|1 year ago
HeatrayEnjoyer|1 year ago
I can't even think of non malicious uses that are anything more than novelty or small conveniences. Meanwhile the malicious use cases are innumerable.
In a just world building this would be a severe felony, punished with prison and destruction of all of the direct and indirect source material.
mensetmanusman|1 year ago
pragma_x|1 year ago
On the one hand, I would love this kind of tech to be available for entertainment purposes. An RPG with convincing NPCs that are able to provide a novel experience for every player? Sounds great.
On the other: this is fraught with ethical problems, not to mention an ideal tool for fraud. At worst, it could be used as a weapon for total asymmetrical warfare on concepts like media integrity and an ideal tool for character assassination; disinformation, propaganda, etc.
I would happy welcome a world where this stuff is nerfed across the board, where videogames and porn are just chock full of AI voice-acting artifacts. We'll adjust and accept that as just a part of the experience, as we have with low fidelity media of the past. But my more cynical side tells me that's not what people in power are concerned about.
ryandrake|1 year ago
unraveller|1 year ago
Perfecting the tech for wide-spread use has trade offs; need for caller id, ease of slandering until trust in voice uniqueness recalibrates, all of which is going to change soon anyway but giving only rich/bad actors the tech at first has its own set of trade offs. Head in the sand is the irresponsible way.