It's kind of wild that these tools just transfer a copy of these models every time they're spun up (whether it's to a Google Colab notebook or a local machine.)
This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)
Their Python module caches the downloads, which is checked before downloading them again...but you're probably not wrong on the crazy bandwidth bill. Looks like they have crazy VC money though, considering the current climate.
Unmetered 10+ gigabit connections were on the order of $1/mbit/mo wholesale over a decade ago when I priced out a custom CDN so for the cost of 100 TB of data transfer out of AWS you could get a 24/7 sustained 10gbit/s (>3 PB per month at 100% utilization).
1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.
2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.
Is hugging face just a model repository like GitHub is a code repository? Seems you can rent compute both cpu & gpu, but you are right that most models seem to be run elsewhere.
This article only covers the musical aspects of AI voice cloning, but there's another dynamic to AI voice cloning that's more complicated: replacing general voice actors in movies/video games/anime (example: https://www.axios.com/2023/07/24/ai-voice-actors-victoria-at... )
Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:
- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)
- Are already underpaid for their work as-is
- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.
I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.
What I find interesting is this aspect that eventually, these companies will hire some college kids who needs a couple thousand bucks and a free pizza. Have them read the right scripts. Sign the right 'give everything away' contract and just forever use their voice. Or do it sneaky. Have a voice assistant and in your ToS 'we can use a copy of your voice for anything'.
The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.
HN isn't the only community to write for. While most people here seem to be unsympathetic to such job concerns, unconventional articles do hit the front page from time to time.
AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.
The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.
It saddens me because of how much impact they had on my family as we played through the story line in Genshin and immersed in the world. At some point we met a few of the voice actors at a convention and they were like stars to us, while I'm sure their circumstances are as you describe.
Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.
Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.
It's sad if the only way voice actors are going to be able to make a living is by doing stuff like Critical Role on Youtube. I love Critical Role but it likely wouldn't be the same if those guys hadn't spent years honing their craft. Watching people play RPGs online has replaced a lot of my streaming viewing now, but the market is much smaller and I imagine it can only sustain a much smaller pool of creatives than the current voice over market can.
Wow. I just realized any one of us could redo Weird Al's songs with his lyrics, but with the original singer's voice. We could be listening to Michael Jackson singing "Just Eat It" by lunchtime.
I am constantly amazed at how the new AI tech can be used.
Of course this would be illegal under most countries copyright laws.
There's also a Weird Al piece "I think I'm a clone now", for which an AI clone voice performance would definitely be fitting. (The original song was "I think we're alone now" by Tommy James and the Shondells, but it seems Weird Al was parodying the cover by Tiffany in the 1980's.)
While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.
My absolute favorite application of this tech so far is The Beach Boys singing 'Hurt'. It's the first time I seriously didn't notice any artifacts, and it just works so well even though it really shouldn't.
I don't know what I was expecting but that isn't Hurt, it's Surfin' USA with Hurt's lyrics that sound extremely jittery and grainy.
I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.
This account is one of the absolute top tier creators for weird music mixes. The recent deep faking stuff has been shockingly good. I think this is a good example of an "acceptable" use of AI, as long as artists/composers etc rights are all settled.
its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.
The graininess of the recording covers over a lot of potential problems. But given that this attempt keeps the Beach Boy’s tempo and enunciation, I think this technique, whatever it is, would make a much more compelling version of Michael Jackson covering Eat It.
The sampled voices sound neither like Michael Jackson nor Weird Al. A good effort, but a professional impersonator could likely do better on either front.
I know what you mean. Its more noticeable (imo) on the Michael one.... but its definitely in there. I think the pitch correction is to blame for a bit of the weirdness.
I did not know about this: "The center of the A.I. cover songs community is a massive 500,000+ member Discord called A.I. Hub, where members trade new tips, tools, techniques, and links to their original and cover songs."
Me neither. That’s what’s so weird about the internet.
Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.
Something I think we're slowly coming to terms with is that the current generation of techies (the ones who can afford to spend hours upon hours tweaking models and sharing results) really prefer Discord over our Web 2.0 forum type communities like this one. Even reddit on, which is lagging in popularity amongst Gen-Z when compared to Discord or TikTok, you can immediately tell upon reading /r/LocalLLMs that a really big chunk of this community are underaged. To be clear, I think this is a good thing!
There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.
There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.
I poked around there for a while, and my takeaway was "sub-par" all around, which might be the reason for it's relative obscurity? The thing is, I can't tell to what extent it's the tech, and to what extent it's just "very uninteresting source material."
Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)
What's the best open source text to speech? Eleven Labs and others are interesting but closed source. I want to use them mainly for audiobooks as I have a lot of ePubs and I'm just using the basic Google text to speech voices on my Android, via Moon+ Reader. It works fine but it's still more robotic than state of the art.
I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].
Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).
IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.
PRE-EDIT, ERRONEOUS ANSWER
Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.
For neutral sounding very fast/efficient voices, I find Coqui TTS VITS models to be very good. For slower, more expressive voice or voice cloning I think the Coqui TTS XTTS is good (or you can look at the mrq/tortoise-tts).
We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.
How many audio books is 40 hours?
Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.
We're probably several years out from it being something people use personally for audio books.
I've tried a few, not an expert, but I think Coqui's new XTTS models are decent performance and quality wise (just in terms of how the speech sounds, can't speak to the voice cloning fidelity as I don't care about that). Open source code but non-commercial license for the model. They also have a bunch of models with more permissive licenses that aren't as good.
I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:
I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.
While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.
(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)
> Artifacts aside, it sounds like Michael Jackson doing a Weird Al impression?! Every line has a distinctly “white and nerdy” vibe: it loses any seriousness and edge, exaggerating words for comic effect and enunciating lyrics really clearly so the punchlines can be heard.
No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.
These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.
I have an accent. If not for that, I'd be a great presenter.
If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.
mecredis|2 years ago
This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)
satertek|2 years ago
civilitty|2 years ago
Bandwidth has always been crazy cheap.
anonylizard|2 years ago
1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.
2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.
toddmorey|2 years ago
pdntspa|2 years ago
Or better yet, how about asking me where I want to store my models?
jonluca|2 years ago
minimaxir|2 years ago
Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:
- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)
- Are already underpaid for their work as-is
- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.
I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.
sumtechguy|2 years ago
The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.
Its going to get crazy.
supriyo-biswas|2 years ago
I'd like to read it, in any case.
ImprobableTruth|2 years ago
GuB-42|2 years ago
Singers didn't want software clones, but voices actors are fair game.
sublinear|2 years ago
AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.
The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.
zerojames|2 years ago
foobarian|2 years ago
raytopia|2 years ago
Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.
Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.
dylan604|2 years ago
The work product produced by their voice for fulfilling the contract is owned. No corp owns someone else's voice.
aaroninsf|2 years ago
EGreg|2 years ago
rcarr|2 years ago
RecycledEle|2 years ago
I am constantly amazed at how the new AI tech can be used.
Of course this would be illegal under most countries copyright laws.
unnah|2 years ago
While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.
greenhearth|2 years ago
RecycledEle|2 years ago
mckirk|2 years ago
Enjoy: https://youtu.be/gmNSFqyg_Z8
dwringer|2 years ago
I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.
danjc|2 years ago
code_runner|2 years ago
its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.
hinkley|2 years ago
nsbk|2 years ago
distantsounds|2 years ago
nemo44x|2 years ago
hinkley|2 years ago
code_runner|2 years ago
causi|2 years ago
ssalka|2 years ago
https://www.youtube.com/watch?v=CkQ-44PvTs8
shepherdjerred|2 years ago
https://www.youtube.com/watch?v=tJjhObngcxI
lostlogin|2 years ago
simonw|2 years ago
codetrotter|2 years ago
Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.
joenot443|2 years ago
There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.
There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.
jrm4|2 years ago
Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)
unknown|2 years ago
[deleted]
smath|2 years ago
https://arstechnica.com/information-technology/2022/09/james...
mito88|2 years ago
Watch Light My Fire on YouTube Music https://music.youtube.com/watch?v=lN3v3EfA6_A&si=_hcG3Wjakxd...
unknown|2 years ago
[deleted]
ddmf|2 years ago
dreamcompiler|2 years ago
I can't figure out if this is an example of Godwin's Law or not.
satvikpendem|2 years ago
entrepy123|2 years ago
I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].
Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).
IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.
PRE-EDIT, ERRONEOUS ANSWER
Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.
lhl|2 years ago
I'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/
NoMoreNicksLeft|2 years ago
How many audio books is 40 hours?
Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.
We're probably several years out from it being something people use personally for audio books.
modeless|2 years ago
I doubt they're better than Google's TTS though.
ticulatedspline|2 years ago
https://github.com/suno-ai/bark Demo at https://huggingface.co/spaces/suno/bark
In the couple samples I tried it was substantially better at picking up meaning compared to VALL-E-X
follower|2 years ago
I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:
* https://github.com/rhasspy/piper
I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.
The official samples are here: https://rhasspy.github.io/piper-samples/
Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/
While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.
(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)
[0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...
----
Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...
unknown|2 years ago
[deleted]
artninja1988|2 years ago
hinkley|2 years ago
No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.
These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.
blagie|2 years ago
I have an accent. If not for that, I'd be a great presenter.
If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.
unknown|2 years ago
[deleted]
Calamitous|2 years ago
> No current artificial intelligence is powerful enough to hide the weirdness of Weird Al.
unknown|2 years ago
[deleted]