I tell friends that the scene below from T2 doesn't feel futuristic anymore. In fact, it now feels... almost mundane. I mean, a smart "script kiddie" with a bit of ML expertise can pull off this kind of deepfake voice spoofing on a relatively cheap desktop computer nowadays. We live in interesting times.
SCENE:
T-800, speaking to John Connor in normal voice: "What's the dog's name?"
John Connor: "Max."
T-800, impersonating John, on the phone with T-1000: "Hey Janelle, what's wrong with Wolfie? I can hear him barking. Is he all right?"
T-1000, impersonating John's foster mother, Janelle: "Wolfie's fine, honey. Wolfie's just fine. Where are you?"
T-800 hangs up the phone and says to John in normal voice: "Your foster parents are dead."
We aren't even far off from an LLM being able to infer that the parents are fake on the basis of the dog name. I'm not even gonna touch the chain gun shooting up a parking lot aspect.
This shouldn't be too surprising. Similar results were had with GPT-3 a while back, which is kind-of-able to produce audio or images, encoded as streams of tokens, when trained on that task, despite not being designed for it.
A very interesting property was noted a few years ago by multiple researchers, I'm not sure who discovered it first. Transfer learning is unreasonably effective. If you were training an image generator network, there's a significant reduction in training time, by taking a model already trained and fine-tuning it, compared to starting from a model with truly random weights.
This isn't surprising when we're talking photos of ambulances and moving to photos of trucks. But it holds true when you train it on ... well, anything structured, really. A GPT-style transformer trained on online comments, or audio samples of music encoded as token streams, when switched to images of cars encoded as token streams, learns that task much more quickly than if it had been fully randomized.
I don't see how to escape the conclusion that these models learn some sort of general properties (something about arithmetic and mathematical relationships, maybe?) There's some sort of abstraction or internal model that is learned, that is applicable across very different tasks.
There's something a bit more mindblowing than that. Language models and vision models learn representations so similar that you can connect them with just a linear projection between image embedding and text embedding space(no training of the image encoder or llm required).
We will find that language and Visual perception are related. Geometry is the underlying structure in language and mathematics, and most of our logical concepts stem from geometric relations and constraints
>"To avoid potential issues, we appeal to our practitioners to not abuse this technology and to develop defending tools to detect AI-synthesized voices"
In the guide on how to make "Harry Potter by Balenciaga" the author shows you how to rip the audio from a vanity fair clip and upload it to a voice cloning service, explicitly including how they clicked in the little box that affirms they have "all the necessary rights and consent" to clone the voice of Daniel Radcliffe... so I'm sure the industry is taking the potential for misuse seriously! /s
Transformers and Diffusion Models seem to be leading the pack lately in many tasks. It’s cool how these models can be used in a variety of quite different contexts without changing much about the network architecture.
That being said, I think it is only a matter of time before cyber criminals develop an end to end fully automated penetration system that registers domain names, writes emails, makes phone calls, finds money mules, runs social media accounts, etc. all with a single console to run it all. That is a scary prospect for humanity and new tools for authenticating human identity will be needed - fast.
We've had the solution in the form of basic TLS cryptography and verification for decades now though, the problem is no one's implementing it.
Governments already maintain registers of legally operating businesses: there's no reason that registration should not also be issuing cryptographic certificates which verify all forms of outbound communication by that business including phone calls.
But despite telecom being almost end-to-end digital (i.e. digital to the box on the street pretty much), there's been no push to close the last 100m. "Phone lines" shouldn't exist anymore with packet switched networking: you should just dial a path against a business, which is verifies itself with TLS certificates linked to it's business registry.
mostly agree I think the web is over as we know it maybe the solution will be the broken web plus some new system that has ties into local regulation ID systems so that you are accountable for your actions
Compared to the first NaturalSpeech[1] I'm hearing a lot of white noise in the background. Singing is pretty cool but it feels like we need a few iterations before it can match the ground truth in the way speech does.
Thanks for your interests in NaturalSpeech and NaturalSpeech 2!
NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.
NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.
Some poking around the authors of the paper brought me to: AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models [1] with demos [2]
This sparks my interest so much since the last few days I was wondering if it was possible to use diffusion models on spectrograms to do audio effects editing. Here is a paper submitted a couple of weeks ago doing just that. And the demo examples are exceptional.
I want all of this to start slowing down a bit so I have a chance to catch up. I was just watching Andrej Karpathy's excellent Zero to Hero syllabus [3] trying to wrap my head around LLMs and now I feel I absolutely must catch up on diffusion models.
I like https://github.com/neonbjb/tortoise-tts ; it doesn't do singing, but the voice reproduction is very good and -most importantly- it's open source and you can run it locally.
I'm guessing emotional pre-prompts are difficult. The current offerings like ElevenLabs and WellSaidLabs provide amazing voices for narration but lack any way to change the emotions (e.g. happy, angry, excited, etc.). I wonder what are the technical hurdles to adding this variability?
Are NaturalSpeech or NaturalSpeech 2 from the research open source and/or available for playing with? I see one implementation[1] but it seems to be from a third party (that might be totally fine, but wondering if there's an "official").
According to one of the authors on their group's GitHub, NaturalSpeech is being deployed exclusively for use on Microsoft Azure [1]. I might have missed a link, however, I think it's likely that NaturalSpeech 2 will follow the same path, seeing as the code and weights are seemingly not published.
Speaking of music and AI models - something I thought of yesterday which is an application of AI that would be insanely useful to me is giving it an audio file of a song and having it spit out the chords. I've seen software that attempts to do this in the past but it's all been unimpressive and inaccurate in my testing.
I'm still kinda ignorant with how these models work under the hood, and perhaps that would involve a bunch of new training on music that hasn't been done (and maybe that could be a difficult dataset to train on in terms of copyright). But I play piano, and I can play a song if given the chords, but I'm terrible at transcribing stuff myself. So I'd pay money for a service that does this.
Man I can't wait until my phone can use local resources to read my epubs in voices that match 11.ai's output. I'll never have to forlornly search Audible for novels that never got audiobook editions again.
Anyone else think its an abuse of terminology to refer to speaker conditioning as "in-context learning" and "prompting" now?
Like, using a reference encoder to condition on an unseen target speaker has been around for 4 or 5 years now. These results are already cool without mystifying the results by calling this "prompting" or "icl"
Unless there is some subtle difference that warrants the new terminology?
Zero shot implies that it was given no direct examples [0] So, in this case, it wasn't given any examples of the exact voice in combination with text, it is just using the prompt voice + a prompt text to generate new audio.
are there openly available models that are similar for the output i.e. generating speech or singing with a voice set by a given sample, but that instead of a text prompt input would take a speech or singing input and take pitch change and intonation cues from that to generate an output that generally follows those pitch and intonation changes but adapted to the different voice and diction of the provided sample? for example:
* provided voice sample: some clean voice samples from Homer Simpson
* result: that same monologue but spoken out in the voice of Homer Simpson, but otherwise following the dynamic of the prompt sample i.e. shouting, changing pitch or speed pretty much at the same times as gunnery sergeant Hartman does?
I find the three links at the top very interesting-
You have a link to the paper (makes sense), then a link to a reddit discussion, then a link to this hacker news post.
Not criticizing them for doing this. It just seems a bit unusual to me. I guess they really really want to generate buzz from this, or else they'd simply link the paper and let any discussion follow naturally.
I mean, most researchers I know are just very excitable about their research and love to share it. I think that's a lot more likely than this being some PR masterplay - if MS wanted to really push this, they wouldn't release like this.
I actually have a chrome extension that will point me to HN discussions on a particular webpage - find it useful to get community context of pages I'm browsing!
[+] [-] cs702|2 years ago|reply
SCENE:
T-800, speaking to John Connor in normal voice: "What's the dog's name?"
John Connor: "Max."
T-800, impersonating John, on the phone with T-1000: "Hey Janelle, what's wrong with Wolfie? I can hear him barking. Is he all right?"
T-1000, impersonating John's foster mother, Janelle: "Wolfie's fine, honey. Wolfie's just fine. Where are you?"
T-800 hangs up the phone and says to John in normal voice: "Your foster parents are dead."
--
Source: https://www.youtube.com/watch?v=MT_u9Rurrqg
[+] [-] poulpy123|2 years ago|reply
[+] [-] ImHereToVote|2 years ago|reply
[+] [-] retrac|2 years ago|reply
A very interesting property was noted a few years ago by multiple researchers, I'm not sure who discovered it first. Transfer learning is unreasonably effective. If you were training an image generator network, there's a significant reduction in training time, by taking a model already trained and fine-tuning it, compared to starting from a model with truly random weights.
This isn't surprising when we're talking photos of ambulances and moving to photos of trucks. But it holds true when you train it on ... well, anything structured, really. A GPT-style transformer trained on online comments, or audio samples of music encoded as token streams, when switched to images of cars encoded as token streams, learns that task much more quickly than if it had been fully randomized.
I don't see how to escape the conclusion that these models learn some sort of general properties (something about arithmetic and mathematical relationships, maybe?) There's some sort of abstraction or internal model that is learned, that is applicable across very different tasks.
[+] [-] famouswaffles|2 years ago|reply
https://arxiv.org/abs/2209.15162 https://llava-vl.github.io/
LLMs are already being grounded.
[+] [-] seydor|2 years ago|reply
[+] [-] delgaudm|2 years ago|reply
Well. I'm sure that will take care of everything.
[+] [-] SiempreViernes|2 years ago|reply
[+] [-] ttul|2 years ago|reply
That being said, I think it is only a matter of time before cyber criminals develop an end to end fully automated penetration system that registers domain names, writes emails, makes phone calls, finds money mules, runs social media accounts, etc. all with a single console to run it all. That is a scary prospect for humanity and new tools for authenticating human identity will be needed - fast.
[+] [-] XorNot|2 years ago|reply
Governments already maintain registers of legally operating businesses: there's no reason that registration should not also be issuing cryptographic certificates which verify all forms of outbound communication by that business including phone calls.
But despite telecom being almost end-to-end digital (i.e. digital to the box on the street pretty much), there's been no push to close the last 100m. "Phone lines" shouldn't exist anymore with packet switched networking: you should just dial a path against a business, which is verifies itself with TLS certificates linked to it's business registry.
[+] [-] tudorw|2 years ago|reply
[+] [-] msoad|2 years ago|reply
[1] https://speechresearch.github.io/naturalspeech/
[+] [-] xutan|2 years ago|reply
NaturalSpeech focuses on synthesizing human-level high-quality speech, by training on a single-speaker recording-studio dataset.
NaturalSpeech 2 trains on 44K hours of multi-speaker in-the-wild datasets with more than 5K speakers and focuses on synthesizing any speaker's voice in a zero-shot way given only a short speech prompt. When the speech prompt is noisy in the background, NaturalSpeech 2 will mimic this noise as well. If you want clean voice, just give a clean speech prompt is OK.
Check more discussions on reddit as well: https://www.reddit.com/r/singularity/comments/12rubq4/latent...
[+] [-] zoogeny|2 years ago|reply
This sparks my interest so much since the last few days I was wondering if it was possible to use diffusion models on spectrograms to do audio effects editing. Here is a paper submitted a couple of weeks ago doing just that. And the demo examples are exceptional.
I want all of this to start slowing down a bit so I have a chance to catch up. I was just watching Andrej Karpathy's excellent Zero to Hero syllabus [3] trying to wrap my head around LLMs and now I feel I absolutely must catch up on diffusion models.
1. https://arxiv.org/abs/2304.00830
2. https://audit-demo.github.io/
3. https://karpathy.ai/zero-to-hero.html
[+] [-] ShamelessC|2 years ago|reply
https://www.riffusion.com/about
[+] [-] sroussey|2 years ago|reply
https://github.com/riffusion/riffusion
[+] [-] Sol-|2 years ago|reply
[+] [-] gigel82|2 years ago|reply
[+] [-] samuelzxu|2 years ago|reply
[+] [-] irln|2 years ago|reply
[+] [-] freedomben|2 years ago|reply
[1]: https://github.com/heatz123/naturalspeech
[+] [-] brikwerk|2 years ago|reply
[1] https://github.com/microsoft/NeuralSpeech/issues/40
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] chikitabanana|2 years ago|reply
[+] [-] hbn|2 years ago|reply
I'm still kinda ignorant with how these models work under the hood, and perhaps that would involve a bunch of new training on music that hasn't been done (and maybe that could be a difficult dataset to train on in terms of copyright). But I play piano, and I can play a song if given the chords, but I'm terrible at transcribing stuff myself. So I'd pay money for a service that does this.
[+] [-] notefaker|2 years ago|reply
[+] [-] causality0|2 years ago|reply
[+] [-] jtr1|2 years ago|reply
[+] [-] Veen|2 years ago|reply
[+] [-] PaulDavisThe1st|2 years ago|reply
"We will always take Microsoft AI Principles as guidelines to develop such AI models"
[+] [-] angusturner|2 years ago|reply
Like, using a reference encoder to condition on an unseen target speaker has been around for 4 or 5 years now. These results are already cool without mystifying the results by calling this "prompting" or "icl"
Unless there is some subtle difference that warrants the new terminology?
[+] [-] cwkoss|2 years ago|reply
Is zero shot a meaningfully useful term that I'm just not groking?
[+] [-] pstorm|2 years ago|reply
[0] https://en.wikipedia.org/wiki/Zero-shot_learning
[+] [-] EntrePrescott|2 years ago|reply
* provided voice sample: some clean voice samples from Homer Simpson
* provided prompt: audio sample of the "gunnery sergeant Hartman" monologue from "Full Metal Jacket": https://www.youtube.com/watch?v=tHxf17yJsKs
* result: that same monologue but spoken out in the voice of Homer Simpson, but otherwise following the dynamic of the prompt sample i.e. shouting, changing pitch or speed pretty much at the same times as gunnery sergeant Hartman does?
[+] [-] andy_xor_andrew|2 years ago|reply
You have a link to the paper (makes sense), then a link to a reddit discussion, then a link to this hacker news post.
Not criticizing them for doing this. It just seems a bit unusual to me. I guess they really really want to generate buzz from this, or else they'd simply link the paper and let any discussion follow naturally.
[+] [-] ImprobableTruth|2 years ago|reply
[+] [-] varunjain99|2 years ago|reply
[+] [-] samuelzxu|2 years ago|reply
[+] [-] yding|2 years ago|reply