top | item 35627790

NaturalSpeech 2: Zero-shot speech and singing synthesizers

235 points| tim_sw | 2 years ago |speechresearch.github.io | reply

120 comments

order
[+] cs702|2 years ago|reply
I tell friends that the scene below from T2 doesn't feel futuristic anymore. In fact, it now feels... almost mundane. I mean, a smart "script kiddie" with a bit of ML expertise can pull off this kind of deepfake voice spoofing on a relatively cheap desktop computer nowadays. We live in interesting times.

SCENE:

T-800, speaking to John Connor in normal voice: "What's the dog's name?"

John Connor: "Max."

T-800, impersonating John, on the phone with T-1000: "Hey Janelle, what's wrong with Wolfie? I can hear him barking. Is he all right?"

T-1000, impersonating John's foster mother, Janelle: "Wolfie's fine, honey. Wolfie's just fine. Where are you?"

T-800 hangs up the phone and says to John in normal voice: "Your foster parents are dead."

--

Source: https://www.youtube.com/watch?v=MT_u9Rurrqg

[+] poulpy123|2 years ago|reply
except that now the T-1000 will have access to the facebook or instagram of Janelle and will know all about Max
[+] ImHereToVote|2 years ago|reply
We aren't even far off from an LLM being able to infer that the parents are fake on the basis of the dog name. I'm not even gonna touch the chain gun shooting up a parking lot aspect.
[+] retrac|2 years ago|reply
This shouldn't be too surprising. Similar results were had with GPT-3 a while back, which is kind-of-able to produce audio or images, encoded as streams of tokens, when trained on that task, despite not being designed for it.

A very interesting property was noted a few years ago by multiple researchers, I'm not sure who discovered it first. Transfer learning is unreasonably effective. If you were training an image generator network, there's a significant reduction in training time, by taking a model already trained and fine-tuning it, compared to starting from a model with truly random weights.

This isn't surprising when we're talking photos of ambulances and moving to photos of trucks. But it holds true when you train it on ... well, anything structured, really. A GPT-style transformer trained on online comments, or audio samples of music encoded as token streams, when switched to images of cars encoded as token streams, learns that task much more quickly than if it had been fully randomized.

I don't see how to escape the conclusion that these models learn some sort of general properties (something about arithmetic and mathematical relationships, maybe?) There's some sort of abstraction or internal model that is learned, that is applicable across very different tasks.

[+] famouswaffles|2 years ago|reply
There's something a bit more mindblowing than that. Language models and vision models learn representations so similar that you can connect them with just a linear projection between image embedding and text embedding space(no training of the image encoder or llm required).

https://arxiv.org/abs/2209.15162 https://llava-vl.github.io/

LLMs are already being grounded.

[+] seydor|2 years ago|reply
We will find that language and Visual perception are related. Geometry is the underlying structure in language and mathematics, and most of our logical concepts stem from geometric relations and constraints
[+] delgaudm|2 years ago|reply
>"To avoid potential issues, we appeal to our practitioners to not abuse this technology and to develop defending tools to detect AI-synthesized voices"

Well. I'm sure that will take care of everything.

[+] SiempreViernes|2 years ago|reply
In the guide on how to make "Harry Potter by Balenciaga" the author shows you how to rip the audio from a vanity fair clip and upload it to a voice cloning service, explicitly including how they clicked in the little box that affirms they have "all the necessary rights and consent" to clone the voice of Daniel Radcliffe... so I'm sure the industry is taking the potential for misuse seriously! /s
[+] ttul|2 years ago|reply
Transformers and Diffusion Models seem to be leading the pack lately in many tasks. It’s cool how these models can be used in a variety of quite different contexts without changing much about the network architecture.

That being said, I think it is only a matter of time before cyber criminals develop an end to end fully automated penetration system that registers domain names, writes emails, makes phone calls, finds money mules, runs social media accounts, etc. all with a single console to run it all. That is a scary prospect for humanity and new tools for authenticating human identity will be needed - fast.

[+] XorNot|2 years ago|reply
We've had the solution in the form of basic TLS cryptography and verification for decades now though, the problem is no one's implementing it.

Governments already maintain registers of legally operating businesses: there's no reason that registration should not also be issuing cryptographic certificates which verify all forms of outbound communication by that business including phone calls.

But despite telecom being almost end-to-end digital (i.e. digital to the box on the street pretty much), there's been no push to close the last 100m. "Phone lines" shouldn't exist anymore with packet switched networking: you should just dial a path against a business, which is verifies itself with TLS certificates linked to it's business registry.

[+] tudorw|2 years ago|reply
mostly agree I think the web is over as we know it maybe the solution will be the broken web plus some new system that has ties into local regulation ID systems so that you are accountable for your actions
[+] msoad|2 years ago|reply
Compared to the first NaturalSpeech[1] I'm hearing a lot of white noise in the background. Singing is pretty cool but it feels like we need a few iterations before it can match the ground truth in the way speech does.

[1] https://speechresearch.github.io/naturalspeech/

[+] xutan|2 years ago|reply
Thanks for your interests in NaturalSpeech and NaturalSpeech 2!

NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.

NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.

Check more discussions on reddit as well: https://www.reddit.com/r/singularity/comments/12rubq4/latent...

[+] zoogeny|2 years ago|reply
Some poking around the authors of the paper brought me to: AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models [1] with demos [2]

This sparks my interest so much since the last few days I was wondering if it was possible to use diffusion models on spectrograms to do audio effects editing. Here is a paper submitted a couple of weeks ago doing just that. And the demo examples are exceptional.

I want all of this to start slowing down a bit so I have a chance to catch up. I was just watching Andrej Karpathy's excellent Zero to Hero syllabus [3] trying to wrap my head around LLMs and now I feel I absolutely must catch up on diffusion models.

1. https://arxiv.org/abs/2304.00830

2. https://audit-demo.github.io/

3. https://karpathy.ai/zero-to-hero.html

[+] Sol-|2 years ago|reply
Only a three second sample for in context learning needed to scam people now, very impressive.
[+] gigel82|2 years ago|reply
I like https://github.com/neonbjb/tortoise-tts ; it doesn't do singing, but the voice reproduction is very good and -most importantly- it's open source and you can run it locally.
[+] samuelzxu|2 years ago|reply
Woah! How is this not more popular? I don't see it referenced in the naturalspeech2 paper anywhere.
[+] irln|2 years ago|reply
I'm guessing emotional pre-prompts are difficult. The current offerings like ElevenLabs and WellSaidLabs provide amazing voices for narration but lack any way to change the emotions (e.g. happy, angry, excited, etc.). I wonder what are the technical hurdles to adding this variability?
[+] freedomben|2 years ago|reply
Are NaturalSpeech or NaturalSpeech 2 from the research open source and/or available for playing with? I see one implementation[1] but it seems to be from a third party (that might be totally fine, but wondering if there's an "official").

[1]: https://github.com/heatz123/naturalspeech

[+] brikwerk|2 years ago|reply
According to one of the authors on their group's GitHub, NaturalSpeech is being deployed exclusively for use on Microsoft Azure [1]. I might have missed a link, however, I think it's likely that NaturalSpeech 2 will follow the same path, seeing as the code and weights are seemingly not published.

[1] https://github.com/microsoft/NeuralSpeech/issues/40

[+] chikitabanana|2 years ago|reply
To those who were able to use it before it was nerfed, how does this compare to the elevenlabs one-shot?
[+] hbn|2 years ago|reply
Speaking of music and AI models - something I thought of yesterday which is an application of AI that would be insanely useful to me is giving it an audio file of a song and having it spit out the chords. I've seen software that attempts to do this in the past but it's all been unimpressive and inaccurate in my testing.

I'm still kinda ignorant with how these models work under the hood, and perhaps that would involve a bunch of new training on music that hasn't been done (and maybe that could be a difficult dataset to train on in terms of copyright). But I play piano, and I can play a song if given the chords, but I'm terrible at transcribing stuff myself. So I'd pay money for a service that does this.

[+] notefaker|2 years ago|reply
This actually exists! Check out fadr.com. They isolate the MIDI, drums, bass, vocals, and more for you.
[+] causality0|2 years ago|reply
Man I can't wait until my phone can use local resources to read my epubs in voices that match 11.ai's output. I'll never have to forlornly search Audible for novels that never got audiobook editions again.
[+] jtr1|2 years ago|reply
Wow, that ethics statement at the end
[+] Veen|2 years ago|reply
Yes, more "please don't be bad" than an ethics statement.
[+] PaulDavisThe1st|2 years ago|reply
Yep. Tell me 1986 self that this is real and see what I say:

"We will always take Microsoft AI Principles as guidelines to develop such AI models"

[+] angusturner|2 years ago|reply
Anyone else think its an abuse of terminology to refer to speaker conditioning as "in-context learning" and "prompting" now?

Like, using a reference encoder to condition on an unseen target speaker has been around for 4 or 5 years now. These results are already cool without mystifying the results by calling this "prompting" or "icl"

Unless there is some subtle difference that warrants the new terminology?

[+] cwkoss|2 years ago|reply
I feel like zero shot is a bad misnomer. Isn't the single attempt "one shot"?

Is zero shot a meaningfully useful term that I'm just not groking?

[+] pstorm|2 years ago|reply
Zero shot implies that it was given no direct examples [0] So, in this case, it wasn't given any examples of the exact voice in combination with text, it is just using the prompt voice + a prompt text to generate new audio.

[0] https://en.wikipedia.org/wiki/Zero-shot_learning

[+] EntrePrescott|2 years ago|reply
are there openly available models that are similar for the output i.e. generating speech or singing with a voice set by a given sample, but that instead of a text prompt input would take a speech or singing input and take pitch change and intonation cues from that to generate an output that generally follows those pitch and intonation changes but adapted to the different voice and diction of the provided sample? for example:

* provided voice sample: some clean voice samples from Homer Simpson

* provided prompt: audio sample of the "gunnery sergeant Hartman" monologue from "Full Metal Jacket": https://www.youtube.com/watch?v=tHxf17yJsKs

* result: that same monologue but spoken out in the voice of Homer Simpson, but otherwise following the dynamic of the prompt sample i.e. shouting, changing pitch or speed pretty much at the same times as gunnery sergeant Hartman does?

[+] andy_xor_andrew|2 years ago|reply
I find the three links at the top very interesting-

You have a link to the paper (makes sense), then a link to a reddit discussion, then a link to this hacker news post.

Not criticizing them for doing this. It just seems a bit unusual to me. I guess they really really want to generate buzz from this, or else they'd simply link the paper and let any discussion follow naturally.

[+] ImprobableTruth|2 years ago|reply
I mean, most researchers I know are just very excitable about their research and love to share it. I think that's a lot more likely than this being some PR masterplay - if MS wanted to really push this, they wouldn't release like this.
[+] varunjain99|2 years ago|reply
I actually have a chrome extension that will point me to HN discussions on a particular webpage - find it useful to get community context of pages I'm browsing!
[+] samuelzxu|2 years ago|reply
Does anyone know how many words would correspond to the diffusion model's batch size of 6000 frames?
[+] yding|2 years ago|reply
Very cool! Hopefully we can all use these in the future for commercial and open source projects.