“I Used AI to Clone My Voice and Trick My Mom into Thinking It Was Me”

[+] pnash|7 years ago|reply

Nine years ago, my late wife had developed a tumor in her throat next to her vocal chords. She was fighting cancer while trying to be a mom to our 3 young boys. Directed radiation treatment was ruled out for this tumor, leaving surgery as the only viable option. The downside was the very real risk of her permanently losing her voice.

Hoping that she’d one day beat the cancer, but may not have a voice, I came up with an idea of trying to “capture it” in 2009 - hoping that it could be algorithmically rebuilt in the future. I reached out to a number of individuals that ultimately put me in touch with a research group that had a proprietary setup for capturing samples and rebuilding the voice. Over the Thanksgiving break, I managed to get access to a soundproof recording room and they worked with my wife to capture samples over a period of 4 hours.

Having worked in the infosec space since the 90s, my first reaction is often either how new tech/innovation can be used to bypass a control and how one could detect/prevent that. It’s easy to lose sight of how something like this could fundamentally changes a persons life.

[+] yomly|7 years ago|reply

This is a great post, although I am sorry for the experiences you went through to acquire this perspective.

Thinking more about the specific use-case you have in mind, I find myself wondering how sentiment and inflection might be captured via a synthetic voice. Would it be inferred by context? How would that inference deal with things like sarcasm/irony. I wonder if there could be some input mechanism for controlling the inflection - what would that input interface look like? Could it go off facial expression?

I wonder where the existing tech sits in the uncanny valley for this space...

[+] asdf1011|7 years ago|reply

Take a listen to the samples from "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis" by the Tacotron team. It's pretty compelling. https://google.github.io/tacotron/publications/speaker_adapt...

[+] throwaway66666|7 years ago|reply

I went camping in the alps once. On our last night, my friend took a bowl and gathered ashes from the campfire. Half ritualistic, half jokingly, she said that those ashes mark our trek and experiences, that she would carry the ashes back home no matter what.

I was very confused how she would pass a bowl of unidentified ash from the airport security (we only had a backpack each). She drafted a poorly done and obviously fake death certificate. It was not campfire ash anymore, it was the remains of her father.

The people at the airport were visibly awkward, they tried to be as accommodating as they could. She flew back home with a plastic bowl of ashes from our campfire, it even had some parts of birch and branches.

Airport security was easily fooled. And the author's mom is easily fooled too, motherly instincts be damned. Would a neural net be fooled by the author's attempts? I know for sure that an automated security system would sound the alarm on my friend. I 'd like to see adversarial networks fighting each other on such premises. A son network trying to fool the mother network and vice versa ad infinitum, at least 1 billion of simulation hours in. What kind of wonders would come out

[+] taneq|7 years ago|reply

I think 'fooled' might be the wrong word to describe the airport security staff. "Had to make a judgement call and decided to err on the side of lenience" is probably a fairer description.

[+] babkayaga|7 years ago|reply

Why would ashes be forbidden on an airplane?

[+] gabcoh|7 years ago|reply

I agree that in the short term adversarial neural networks may be a good line of defense against machine fabricated audio and video, but in the long term it’s a loosing battle. Eventually the neural networks producing the audio and video will become pixel perfect and at that point no neural network would be able to detect the manipulation. I think we need to seek out a different, more future proof solution to the problem.

[+] mikeash|7 years ago|reply

I think you basically described Earth’s entire biosphere at the end there.

[+] kache_|7 years ago|reply

That's a great way to spread wood parasites.

[+] TeMPOraL|7 years ago|reply

Many years ago, I briefly used to do the reverse. I was tired of constant calls by certain people, so I jokingly started to pick up and say "the subscriber is currently unavailable; please leave a message after the tone <BEEP>". Fooled two people with it before realizing that's a little too disrespectful and stopping.

[+] epaga|7 years ago|reply

I did this one time as a joke, the other person hung up immediately. I only figured out weeks later it was a good friend I hadn't seen in ages who had happened to be in town that day and had wanted to get together.

They were ... not very happy when we realized what had happened.

[+] nsomaru|7 years ago|reply

I have a friend whose voicemail sounded like they were answering the phone. So annoying!

[+] minimaxir|7 years ago|reply

(Disclosure: I work at BuzzFeed)

I do recommend watching the episode of Follow This as suggested in the article (episode 7) if you’re interested in the latest deepfake tech, and its implications for fooling people who can’t obviously tell it’s fake.

[+] trqx|7 years ago|reply

Hi,

> required

Is this even legal according to GDPR?

https://screenshots.firefox.com/MnEgMtsGavMlxcts/www.buzzfee...

[+] gcatalfamo|7 years ago|reply

What do you do at buzzfeed, if you can/want to disclose?

[+] actionowl|7 years ago|reply

I was tempted to sign up and try using it for our daily scrum standup conf calls for fun. But after thinking about it I'm also terrified of the possibility that my account or data could be compromised. Imagine the damage someone could do by calling up a relative, posing as me, and saying that I'm in trouble and need money or something?

[+] DEADBEEFC0FFEE|7 years ago|reply

Something you know, something you have. Time to share some authentication bsecrets with family.

[+] throwaway208113|7 years ago|reply

I was hoping there was another method of doing this instead of playing the audio file out the speaker and using the phone in speakerphone mode.

I have a stutter that is especially bad when I first talk on the phone. I used to do something similar where I would record an introduction and then play it when the phone connected.

The quality wasn't great, but it was better than me not being able to say anything.

[+] tantalor|7 years ago|reply

You could probably configure the computer as a bluetooth microphone

[+] kelvin0|7 years ago|reply

The only reason this 'fooled' his mom is because cell phone sound is already bad, so the obvious garbles, fluctuating intonation and weird pauses seem normal.

So it's impressive, but let's not get ahead of ourselves here.

[+] Splognosticus|7 years ago|reply

That's not as big a drawback as you might think. There's a guy on YouTube[1] that with a channel fashioned after one of those Saturday morning edutainment shows who debunks hoax videos, and one of the tricks he frequently points out is when the video quality has been deliberately degraded to mask editing flaws. It's good enough to fool anybody not looking for it.

[1] https://www.youtube.com/channel/UCEOXxzW2vU0P-0THehuIIeg

[+] richrichardsson|7 years ago|reply

I wonder if this would fool the "My voice is my passport" identification systems that I noticed cropped up on a few of the telephone services I was using in the UK?

[+] ShakataGaNai|7 years ago|reply

Kinda a strange way to go about this, but interesting none the less. I don't know much about Lyrebird or how long they've been around, but as others have noted... this sounds like a really terrible voice call at best (so far as the samples have shown).

I want to use it, but I wouldn't use it for any real products today. Amazon Poly isn't amazing either, but sounds more natural than these samples. Yes, it's only a few stock voices, but it's a lot closer.

[+] inawarminister|7 years ago|reply

Again, it's not the quality per se, but the software mimicking your personal voice with accents, personal quibbles (like I think the BuzzFeed writer pronounce words with a high pitch in the middle?) to lead the person on the other side of the line to conclude "oh yeah, it's him/her".

You know, voice call sound quality go full terrible in real life too. If, say, one of my ex-gf or even my sister suddenly call and sounded a bit robotic like this I would believe them to be who they claim to be, no contest.

[+] phkahler|7 years ago|reply

>> this sounds like a really terrible voice call at best

Do you believe this isn't an advertisement for Lyrebird?

[+] bcaa7f3a8bbc|7 years ago|reply

Public key verification is ultimately needed in all end-to-end encryption systems to offer a strong guarantee that the conversation is not subverted by a man-in-the-middle attacker. If can be done by using the Socialist Millionaire Protocol (https://en.wikipedia.org/wiki/Socialist_millionaires) and a shared secret, but more often the verification is arranged in-person, out-of-band, manually.

As realtime, realistic voice synthesize is thought to be difficult, a voice/phone call encryption system usually circumvents this problem by using the caller's voice as the proof, as both recognize each other's voice. In most phone encryption systems, like many commercial systems, or the ZRTP protocol by Phil Zimmermann, or the "safety number" in Signal, they allow both parties to read out their pubkey's SHA-256 hash digest (usually encoded to words) aloud, as a mean of verification.

If this type of AI-based voice synthesizer becomes widely-available, it could be disastrous to cryptography. It is not the end-of-the-world of course, as targeted attack with social engineering is not an issue for most people, and those who need this level of security is going to perform out-of-band exchange anyway, but still, the certainty of voice-based key verification would be greatly weakened.

[+] beguiledfoil|7 years ago|reply

Cryptography doesn't solve social problems.

Cryptography doesn't solve political problems.

Cryptography solves communication problems. That's. It.

[+] jklein11|7 years ago|reply

The cadence of his speech in the video was clearly off. It was pretty jarring to me.

It didn't sound like the mom was buying it either. Her tone of voice was somewhere between "I'm going to play along with this" to "Dear god I think my son is on drugs again."

[+] kelvin0|7 years ago|reply

Definitely not convincing to me either. But things sound so bad over a cell phone conversation, which kinda makes this works if you squint your ears and drink a liter of spirits.

[+] zahrc|7 years ago|reply

The examples in this thread don't seem very convincing for me, is that because I know that it's an AI?

Not judging his work here, but it sounds like an unstable VoIP call.

[+] eboyjr|7 years ago|reply

I agree that it's not convincing. But I think over the phone it makes it close to indiscernible.

[+] defnotarobot|7 years ago|reply

Is there anything at all like this open source?

[+] jasonlfunk|7 years ago|reply

In what sense can this actually be considered to be "AI"? It's software that builds a voice profile from input sources and then uses that profile to generate waveforms. Where is the intelligence?

[+] tantalor|7 years ago|reply

Artificial intelligence can be defined as training a model with real-world input/output pairs & approximating a general solution to generating output for arbitrary input.

In this case, your brain (the "non-artificial intelligence") can take some text and control your vocal chords to emit sound waves to produce speech. You can even learn different voices like a cartoon character voice artist. The artificial intelligence can learn to do the same thing.

[+] taneq|7 years ago|reply

The intelligence is in another castle. Convincingly mimicking someone's voice using a computer is now possible and therefore is not AI.

[+] dan678|7 years ago|reply

Presumably, they use sample recordings to train a model. This approach is in the domain of machine learning, a subset of AI.

[+] unknown|7 years ago|reply

[deleted]

[+] PeterStuer|7 years ago|reply

A few years ago, well before Trump was a thing, a writer asked me what I feared about the future. After thinking about it for a few minutes, I answered 'the post-truth society'. Even without AI algos, we were already well on our way with 'traditional' evidence forging technology and public discourse manipulations, to cast reasonable suspicion on everything we see or hear.

Simple AI accelerated and deskilled the former, and combined with ubiquitous social networks exponentially empowered the latter.

The thing is: these truth undermining technologies need not be perfect to have the effect, just 'good enough' to cast significant doubt and allow near everyone to believe their own 'truths'.

The result will be a highly dis-empathic society, where trust beyond the most closest 'clan' is close to nil and even then some.

Confusion always empowered narcissist and sociopaths, the con-artists and the cultists. It isn't so hard to see anymore how old civilizations could devolve into the dark ages.

[+] gm-conspiracy|7 years ago|reply

Agreed.

We have already seen this work on "lo-fi" ads.

Ads with obvious spelling and grammar errors, meant to immediately engage those who have no concerns of such things (filter out the critical thinkers).

[+] ThinkingGuy|7 years ago|reply

Clearly there's a lot of potential for abuse here. On the other hand, similar technology has enabled radio reporter Jamie Dupree to get back on the air after losing his voice to a rare neurological condition:

http://jamiedupree.blog.wsbradio.com/2018/06/18/back-on-the-...

[+] butler14|7 years ago|reply

What's wrong with wolfie?

https://youtu.be/MT_u9Rurrqg?t=45s

[+] bufferoverflow|7 years ago|reply

It clearly sounds synthesized.

[+] tyingq|7 years ago|reply

Also sounds pretty close to a spotty VoIP connection.

http://www.voiptroubleshooter.com/problems/robotic.html

[+] skookumchuck|7 years ago|reply

I kept thinking of Cylons "By your command!"

https://www.youtube.com/watch?v=0ccKPSVQcFk

[+] abraham_lincoln|7 years ago|reply

This week.

[+] skookumchuck|7 years ago|reply

So much for audio recordings being evidence in court.

I always wondered why emails were evidence. They're just text, anyone could fake an email.

[+] Kostchei|7 years ago|reply

Sure, you could say the same of any document, electronic or not. Most crime is not that complicated.

  It turns out that the metadata footprint left on a computer creating a document- if you can seize it- is rich in detail enabling creation date and the like to be identified. Mail servers may hold logs. Often a fake email or document is part of an offense and proving where a faked email was sent from becomes quite relevant, and yes, ends up as evidence in court.

You may hear a prosecutor say "and on the 30th of june 2011 did you send the following email..blah..blah"

There is a reason for that- they are establishing the possibility that it is actually evidence. Its not clear cut, otherwise we wouldn't have courts. You proffer evidence and convince people of it's weight. And that will continue to be the case.

[+] TeMPOraL|7 years ago|reply

> I always wondered why emails were evidence. They're just text, anyone could fake an email.

The right question isn't usually if they could, but how likely it is they would. Also, I do hope cases aren't usually hinged on a single e-mail, but include other evidence that together paints a coherent story.

[+] solveit|7 years ago|reply

IANAL but I expect emails are only evidence to the extent that the other party acknowledges (or at least, doesn't object to) their veracity. If they do object, then you can have fun tracking down logs and stuff. Of course, that's also just text, so if that's called into question, you start questioning the people involved in maintaining the logs, etc, and trying to figure out which story is more plausible. So in short, what courts do on a daily basis anyway. All evidence is circumstantial to some extent.

[+] paul7986|7 years ago|reply

I rarely call or use the phone vs. texting. I would think that almost half the population rarely use the phone too.

Also the perpetrator is going to have to spoof my exact number to trick friends & relatives. Who I would hope would after speaking to a fake me would then text me soon or a bit after talking with comments & questions.

76 comments