top | item 37739233

Weird A.I. Yankovic: a cursed deep dive into the world of voice cloning

328 points| waxpancake | 2 years ago |waxy.org

198 comments

order

mecredis|2 years ago

It's kind of wild that these tools just transfer a copy of these models every time they're spun up (whether it's to a Google Colab notebook or a local machine.)

This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)

satertek|2 years ago

Their Python module caches the downloads, which is checked before downloading them again...but you're probably not wrong on the crazy bandwidth bill. Looks like they have crazy VC money though, considering the current climate.

civilitty|2 years ago

Unmetered 10+ gigabit connections were on the order of $1/mbit/mo wholesale over a decade ago when I priced out a custom CDN so for the cost of 100 TB of data transfer out of AWS you could get a 24/7 sustained 10gbit/s (>3 PB per month at 100% utilization).

Bandwidth has always been crazy cheap.

anonylizard|2 years ago

Huggingface has a strategic partnership with AWS.

1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.

2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.

toddmorey|2 years ago

Is hugging face just a model repository like GitHub is a code repository? Seems you can rent compute both cpu & gpu, but you are right that most models seem to be run elsewhere.

pdntspa|2 years ago

I really wish I could configure this crap to cache somewhere other than my C: drive

Or better yet, how about asking me where I want to store my models?

minimaxir|2 years ago

This article only covers the musical aspects of AI voice cloning, but there's another dynamic to AI voice cloning that's more complicated: replacing general voice actors in movies/video games/anime (example: https://www.axios.com/2023/07/24/ai-voice-actors-victoria-at... )

Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:

- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)

- Are already underpaid for their work as-is

- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.

I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.

sumtechguy|2 years ago

What I find interesting is this aspect that eventually, these companies will hire some college kids who needs a couple thousand bucks and a free pizza. Have them read the right scripts. Sign the right 'give everything away' contract and just forever use their voice. Or do it sneaky. Have a voice assistant and in your ToS 'we can use a copy of your voice for anything'.

The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.

Its going to get crazy.

supriyo-biswas|2 years ago

HN isn't the only community to write for. While most people here seem to be unsympathetic to such job concerns, unconventional articles do hit the front page from time to time.

I'd like to read it, in any case.

ImprobableTruth|2 years ago

Voices are uncopyrightable, but impersonation isn't legal (see Midler v. Ford, for a notable case), so I don't think the situation is totally clear.

GuB-42|2 years ago

Interesting note: many Vocaloids (most notably Hatsune Miku) are sampled from voice actors rather than singers.

Singers didn't want software clones, but voices actors are fair game.

sublinear|2 years ago

I have a different take on this.

AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.

The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.

zerojames|2 years ago

I am interested! You should write about what you find interesting; never worry if it will interest a particular group.

foobarian|2 years ago

It saddens me because of how much impact they had on my family as we played through the story line in Genshin and immersed in the world. At some point we met a few of the voice actors at a convention and they were like stars to us, while I'm sure their circumstances are as you describe.

raytopia|2 years ago

I'd be interested.

Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.

Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.

dylan604|2 years ago

> and their voicework is owned by their clients who may not be as incentivised in protecting the VO.

The work product produced by their voice for fulfilling the contract is owned. No corp owns someone else's voice.

EGreg|2 years ago

Please do. Some of us critique capitalism

rcarr|2 years ago

It's sad if the only way voice actors are going to be able to make a living is by doing stuff like Critical Role on Youtube. I love Critical Role but it likely wouldn't be the same if those guys hadn't spent years honing their craft. Watching people play RPGs online has replaced a lot of my streaming viewing now, but the market is much smaller and I imagine it can only sustain a much smaller pool of creatives than the current voice over market can.

RecycledEle|2 years ago

Wow. I just realized any one of us could redo Weird Al's songs with his lyrics, but with the original singer's voice. We could be listening to Michael Jackson singing "Just Eat It" by lunchtime.

I am constantly amazed at how the new AI tech can be used.

Of course this would be illegal under most countries copyright laws.

unnah|2 years ago

There's also a Weird Al piece "I think I'm a clone now", for which an AI clone voice performance would definitely be fitting. (The original song was "I think we're alone now" by Tommy James and the Shondells, but it seems Weird Al was parodying the cover by Tiffany in the 1980's.)

While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.

greenhearth|2 years ago

How would this be amazing? It just sounds stupid and a waste of time.

mckirk|2 years ago

My absolute favorite application of this tech so far is The Beach Boys singing 'Hurt'. It's the first time I seriously didn't notice any artifacts, and it just works so well even though it really shouldn't.

Enjoy: https://youtu.be/gmNSFqyg_Z8

dwringer|2 years ago

I don't know what I was expecting but that isn't Hurt, it's Surfin' USA with Hurt's lyrics that sound extremely jittery and grainy.

I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.

code_runner|2 years ago

This account is one of the absolute top tier creators for weird music mixes. The recent deep faking stuff has been shockingly good. I think this is a good example of an "acceptable" use of AI, as long as artists/composers etc rights are all settled.

its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.

hinkley|2 years ago

The graininess of the recording covers over a lot of potential problems. But given that this attempt keeps the Beach Boy’s tempo and enunciation, I think this technique, whatever it is, would make a much more compelling version of Michael Jackson covering Eat It.

nsbk|2 years ago

That hurt

distantsounds|2 years ago

The sampled voices sound neither like Michael Jackson nor Weird Al. A good effort, but a professional impersonator could likely do better on either front.

nemo44x|2 years ago

It sounds like Weird Al trying to be Michael Jackson trying to be Weird Al.

hinkley|2 years ago

The best Michael Jackson interpreter in a town of 50,000 could do better than this. It’s… this is bad.

code_runner|2 years ago

I know what you mean. Its more noticeable (imo) on the Michael one.... but its definitely in there. I think the pitch correction is to blame for a bit of the weirdness.

simonw|2 years ago

I did not know about this: "The center of the A.I. cover songs community is a massive 500,000+ member Discord called A.I. Hub, where members trade new tips, tools, techniques, and links to their original and cover songs."

codetrotter|2 years ago

Me neither. That’s what’s so weird about the internet.

Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.

joenot443|2 years ago

Something I think we're slowly coming to terms with is that the current generation of techies (the ones who can afford to spend hours upon hours tweaking models and sharing results) really prefer Discord over our Web 2.0 forum type communities like this one. Even reddit on, which is lagging in popularity amongst Gen-Z when compared to Discord or TikTok, you can immediately tell upon reading /r/LocalLLMs that a really big chunk of this community are underaged. To be clear, I think this is a good thing!

There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.

There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.

jrm4|2 years ago

I poked around there for a while, and my takeaway was "sub-par" all around, which might be the reason for it's relative obscurity? The thing is, I can't tell to what extent it's the tech, and to what extent it's just "very uninteresting source material."

Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)

ddmf|2 years ago

The most recent episode of Tacoma FD covered something similar to this mixed with a messed up Christmas Carol.

dreamcompiler|2 years ago

> ... Tom Waits, LeBron James, Knuckles, and, uh, Adolf Hitler.

I can't figure out if this is an example of Godwin's Law or not.

satvikpendem|2 years ago

What's the best open source text to speech? Eleven Labs and others are interesting but closed source. I want to use them mainly for audiobooks as I have a lot of ePubs and I'm just using the basic Google text to speech voices on my Android, via Moon+ Reader. It works fine but it's still more robotic than state of the art.

entrepy123|2 years ago

POST-EDIT, CORRECTED ANSWER

I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].

Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).

IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.

PRE-EDIT, ERRONEOUS ANSWER

Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.

  [0] https://github.com/snakers4/silero-models#text-to-speech
  [1] https://silero.ai
  [2] https://github.com/snakers4/silero-models#standalone-use
  [3] https://github.com/Grumbel/ttsprech#usage

lhl|2 years ago

For neutral sounding very fast/efficient voices, I find Coqui TTS VITS models to be very good. For slower, more expressive voice or voice cloning I think the Coqui TTS XTTS is good (or you can look at the mrq/tortoise-tts).

I'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/

NoMoreNicksLeft|2 years ago

We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

How many audio books is 40 hours?

Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.

We're probably several years out from it being something people use personally for audio books.

modeless|2 years ago

I've tried a few, not an expert, but I think Coqui's new XTTS models are decent performance and quality wise (just in terms of how the speech sounds, can't speak to the voice cloning fidelity as I don't care about that). Open source code but non-commercial license for the model. They also have a bunch of models with more permissive licenses that aren't as good.

I doubt they're better than Google's TTS though.

follower|2 years ago

> What's the best open source text to speech?

I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:

* https://github.com/rhasspy/piper

I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.

The official samples are here: https://rhasspy.github.io/piper-samples/

Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/

While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.

(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)

[0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

----

Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

artninja1988|2 years ago

Would also like to know this. Can't seem to find an open source tts engine that works on mobile to read muh books

hinkley|2 years ago

> Artifacts aside, it sounds like Michael Jackson doing a Weird Al impression?! Every line has a distinctly “white and nerdy” vibe: it loses any seriousness and edge, exaggerating words for comic effect and enunciating lyrics really clearly so the punchlines can be heard.

No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.

These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.

blagie|2 years ago

... they're good enough.

I have an accent. If not for that, I'd be a great presenter.

If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.

Calamitous|2 years ago

Key takeaway:

> No current artificial intelligence is powerful enough to hide the weirdness of Weird Al.