Nine years ago, my late wife had developed a tumor in her throat next to her vocal chords. She was fighting cancer while trying to be a mom to our 3 young boys. Directed radiation treatment was ruled out for this tumor, leaving surgery as the only viable option. The downside was the very real risk of her permanently losing her voice.
Hoping that she’d one day beat the cancer, but may not have a voice, I came up with an idea of trying to “capture it” in 2009 - hoping that it could be algorithmically rebuilt in the future. I reached out to a number of individuals that ultimately put me in touch with a research group that had a proprietary setup for capturing samples and rebuilding the voice. Over the Thanksgiving break, I managed to get access to a soundproof recording room and they worked with my wife to capture samples over a period of 4 hours.
Having worked in the infosec space since the 90s, my first reaction is often either how new tech/innovation can be used to bypass a control and how one could detect/prevent that. It’s easy to lose sight of how something like this could fundamentally changes a persons life.
This is a great post, although I am sorry for the experiences you went through to acquire this perspective.
Thinking more about the specific use-case you have in mind, I find myself wondering how sentiment and inflection might be captured via a synthetic voice. Would it be inferred by context? How would that inference deal with things like sarcasm/irony. I wonder if there could be some input mechanism for controlling the inflection - what would that input interface look like? Could it go off facial expression?
I wonder where the existing tech sits in the uncanny valley for this space...
I went camping in the alps once. On our last night, my friend took a bowl and gathered ashes from the campfire. Half ritualistic, half jokingly, she said that those ashes mark our trek and experiences, that she would carry the ashes back home no matter what.
I was very confused how she would pass a bowl of unidentified ash from the airport security (we only had a backpack each). She drafted a poorly done and obviously fake death certificate. It was not campfire ash anymore, it was the remains of her father.
The people at the airport were visibly awkward, they tried to be as accommodating as they could. She flew back home with a plastic bowl of ashes from our campfire, it even had some parts of birch and branches.
Airport security was easily fooled. And the author's mom is easily fooled too, motherly instincts be damned. Would a neural net be fooled by the author's attempts? I know for sure that an automated security system would sound the alarm on my friend. I 'd like to see adversarial networks fighting each other on such premises. A son network trying to fool the mother network and vice versa ad infinitum, at least 1 billion of simulation hours in. What kind of wonders would come out
I think 'fooled' might be the wrong word to describe the airport security staff. "Had to make a judgement call and decided to err on the side of lenience" is probably a fairer description.
I agree that in the short term adversarial neural networks may be a good line of defense against machine fabricated audio and video, but in the long term it’s a loosing battle. Eventually the neural networks producing the audio and video will become pixel perfect and at that point no neural network would be able to detect the manipulation. I think we need to seek out a different, more future proof solution to the problem.
Many years ago, I briefly used to do the reverse. I was tired of constant calls by certain people, so I jokingly started to pick up and say "the subscriber is currently unavailable; please leave a message after the tone <BEEP>". Fooled two people with it before realizing that's a little too disrespectful and stopping.
I did this one time as a joke, the other person hung up immediately. I only figured out weeks later it was a good friend I hadn't seen in ages who had happened to be in town that day and had wanted to get together.
They were ... not very happy when we realized what had happened.
I do recommend watching the episode of Follow This as suggested in the article (episode 7) if you’re interested in the latest deepfake tech, and its implications for fooling people who can’t obviously tell it’s fake.
I was tempted to sign up and try using it for our daily scrum standup conf calls for fun. But after thinking about it I'm also terrified of the possibility that my account or data could be compromised. Imagine the damage someone could do by calling up a relative, posing as me, and saying that I'm in trouble and need money or something?
I was hoping there was another method of doing this instead of playing the audio file out the speaker and using the phone in speakerphone mode.
I have a stutter that is especially bad when I first talk on the phone. I used to do something similar where I would record an introduction and then play it when the phone connected.
The quality wasn't great, but it was better than me not being able to say anything.
The only reason this 'fooled' his mom is because cell phone sound is already bad, so the obvious garbles, fluctuating intonation and weird pauses seem normal.
So it's impressive, but let's not get ahead of ourselves here.
That's not as big a drawback as you might think. There's a guy on YouTube[1] that with a channel fashioned after one of those Saturday morning edutainment shows who debunks hoax videos, and one of the tricks he frequently points out is when the video quality has been deliberately degraded to mask editing flaws. It's good enough to fool anybody not looking for it.
I wonder if this would fool the "My voice is my passport" identification systems that I noticed cropped up on a few of the telephone services I was using in the UK?
Kinda a strange way to go about this, but interesting none the less. I don't know much about Lyrebird or how long they've been around, but as others have noted... this sounds like a really terrible voice call at best (so far as the samples have shown).
I want to use it, but I wouldn't use it for any real products today. Amazon Poly isn't amazing either, but sounds more natural than these samples. Yes, it's only a few stock voices, but it's a lot closer.
Again, it's not the quality per se, but the software mimicking your personal voice with accents, personal quibbles (like I think the BuzzFeed writer pronounce words with a high pitch in the middle?) to lead the person on the other side of the line to conclude "oh yeah, it's him/her".
You know, voice call sound quality go full terrible in real life too. If, say, one of my ex-gf or even my sister suddenly call and sounded a bit robotic like this I would believe them to be who they claim to be, no contest.
Public key verification is ultimately needed in all end-to-end encryption systems to offer a strong guarantee that the conversation is not subverted by a man-in-the-middle attacker. If can be done by using the Socialist Millionaire Protocol (https://en.wikipedia.org/wiki/Socialist_millionaires) and a shared secret, but more often the verification is arranged in-person, out-of-band, manually.
As realtime, realistic voice synthesize is thought to be difficult, a voice/phone call encryption system usually circumvents this problem by using the caller's voice as the proof, as both recognize each other's voice. In most phone encryption systems, like many commercial systems, or the ZRTP protocol by Phil Zimmermann, or the "safety number" in Signal, they allow both parties to read out their pubkey's SHA-256 hash digest (usually encoded to words) aloud, as a mean of verification.
If this type of AI-based voice synthesizer becomes widely-available, it could be disastrous to cryptography. It is not the end-of-the-world of course, as targeted attack with social engineering is not an issue for most people, and those who need this level of security is going to perform out-of-band exchange anyway, but still, the certainty of voice-based key verification would be greatly weakened.
The cadence of his speech in the video was clearly off. It was pretty jarring to me.
It didn't sound like the mom was buying it either. Her tone of voice was somewhere between "I'm going to play along with this" to "Dear god I think my son is on drugs again."
Definitely not convincing to me either. But things sound so bad over a cell phone conversation, which kinda makes this works if you squint your ears and drink a liter of spirits.
In what sense can this actually be considered to be "AI"? It's software that builds a voice profile from input sources and then uses that profile to generate waveforms. Where is the intelligence?
Artificial intelligence can be defined as training a model with real-world input/output pairs & approximating a general solution to generating output for arbitrary input.
In this case, your brain (the "non-artificial intelligence") can take some text and control your vocal chords to emit sound waves to produce speech. You can even learn different voices like a cartoon character voice artist. The artificial intelligence can learn to do the same thing.
A few years ago, well before Trump was a thing, a writer asked me what I feared about the future. After thinking about it for a few minutes, I answered 'the post-truth society'.
Even without AI algos, we were already well on our way with 'traditional' evidence forging technology and public discourse manipulations, to cast reasonable suspicion on everything we see or hear.
Simple AI accelerated and deskilled the former, and combined with ubiquitous social networks exponentially empowered the latter.
The thing is: these truth undermining technologies need not be perfect to have the effect, just 'good enough' to cast significant doubt and allow near everyone to believe their own 'truths'.
The result will be a highly dis-empathic society, where trust beyond the most closest 'clan' is close to nil and even then some.
Confusion always empowered narcissist and sociopaths, the con-artists and the cultists. It isn't so hard to see anymore how old civilizations could devolve into the dark ages.
Ads with obvious spelling and grammar errors, meant to immediately engage those who have no concerns of such things (filter out the critical thinkers).
Clearly there's a lot of potential for abuse here. On the other hand, similar technology has enabled radio reporter Jamie Dupree to get back on the air after losing his voice to a rare neurological condition:
Sure, you could say the same of any document, electronic or not. Most crime is not that complicated.
It turns out that the metadata footprint left on a computer creating a document- if you can seize it- is rich in detail enabling creation date and the like to be identified. Mail servers may hold logs. Often a fake email or document is part of an offense and proving where a faked email was sent from becomes quite relevant, and yes, ends up as evidence in court.
You may hear a prosecutor say "and on the 30th of june 2011 did you send the following email..blah..blah"
There is a reason for that- they are establishing the possibility that it is actually evidence. Its not clear cut, otherwise we wouldn't have courts. You proffer evidence and convince people of it's weight. And that will continue to be the case.
> I always wondered why emails were evidence. They're just text, anyone could fake an email.
The right question isn't usually if they could, but how likely it is they would. Also, I do hope cases aren't usually hinged on a single e-mail, but include other evidence that together paints a coherent story.
IANAL but I expect emails are only evidence to the extent that the other party acknowledges (or at least, doesn't object to) their veracity. If they do object, then you can have fun tracking down logs and stuff. Of course, that's also just text, so if that's called into question, you start questioning the people involved in maintaining the logs, etc, and trying to figure out which story is more plausible. So in short, what courts do on a daily basis anyway. All evidence is circumstantial to some extent.
I rarely call or use the phone vs. texting. I would think that almost half the population rarely use the phone too.
Also the perpetrator is going to have to spoof my exact number to trick friends & relatives. Who I would hope would after speaking to a fake me would then text me soon or a bit after talking with comments & questions.
[+] [-] pnash|7 years ago|reply
Hoping that she’d one day beat the cancer, but may not have a voice, I came up with an idea of trying to “capture it” in 2009 - hoping that it could be algorithmically rebuilt in the future. I reached out to a number of individuals that ultimately put me in touch with a research group that had a proprietary setup for capturing samples and rebuilding the voice. Over the Thanksgiving break, I managed to get access to a soundproof recording room and they worked with my wife to capture samples over a period of 4 hours.
Having worked in the infosec space since the 90s, my first reaction is often either how new tech/innovation can be used to bypass a control and how one could detect/prevent that. It’s easy to lose sight of how something like this could fundamentally changes a persons life.
[+] [-] yomly|7 years ago|reply
Thinking more about the specific use-case you have in mind, I find myself wondering how sentiment and inflection might be captured via a synthetic voice. Would it be inferred by context? How would that inference deal with things like sarcasm/irony. I wonder if there could be some input mechanism for controlling the inflection - what would that input interface look like? Could it go off facial expression?
I wonder where the existing tech sits in the uncanny valley for this space...
[+] [-] asdf1011|7 years ago|reply
[+] [-] throwaway66666|7 years ago|reply
I was very confused how she would pass a bowl of unidentified ash from the airport security (we only had a backpack each). She drafted a poorly done and obviously fake death certificate. It was not campfire ash anymore, it was the remains of her father.
The people at the airport were visibly awkward, they tried to be as accommodating as they could. She flew back home with a plastic bowl of ashes from our campfire, it even had some parts of birch and branches.
Airport security was easily fooled. And the author's mom is easily fooled too, motherly instincts be damned. Would a neural net be fooled by the author's attempts? I know for sure that an automated security system would sound the alarm on my friend. I 'd like to see adversarial networks fighting each other on such premises. A son network trying to fool the mother network and vice versa ad infinitum, at least 1 billion of simulation hours in. What kind of wonders would come out
[+] [-] taneq|7 years ago|reply
[+] [-] babkayaga|7 years ago|reply
[+] [-] gabcoh|7 years ago|reply
[+] [-] mikeash|7 years ago|reply
[+] [-] kache_|7 years ago|reply
[+] [-] TeMPOraL|7 years ago|reply
[+] [-] epaga|7 years ago|reply
They were ... not very happy when we realized what had happened.
[+] [-] nsomaru|7 years ago|reply
[+] [-] minimaxir|7 years ago|reply
I do recommend watching the episode of Follow This as suggested in the article (episode 7) if you’re interested in the latest deepfake tech, and its implications for fooling people who can’t obviously tell it’s fake.
[+] [-] trqx|7 years ago|reply
> required
Is this even legal according to GDPR?
https://screenshots.firefox.com/MnEgMtsGavMlxcts/www.buzzfee...
[+] [-] gcatalfamo|7 years ago|reply
[+] [-] actionowl|7 years ago|reply
[+] [-] DEADBEEFC0FFEE|7 years ago|reply
[+] [-] throwaway208113|7 years ago|reply
I have a stutter that is especially bad when I first talk on the phone. I used to do something similar where I would record an introduction and then play it when the phone connected.
The quality wasn't great, but it was better than me not being able to say anything.
[+] [-] tantalor|7 years ago|reply
[+] [-] kelvin0|7 years ago|reply
So it's impressive, but let's not get ahead of ourselves here.
[+] [-] Splognosticus|7 years ago|reply
[1] https://www.youtube.com/channel/UCEOXxzW2vU0P-0THehuIIeg
[+] [-] richrichardsson|7 years ago|reply
[+] [-] ShakataGaNai|7 years ago|reply
I want to use it, but I wouldn't use it for any real products today. Amazon Poly isn't amazing either, but sounds more natural than these samples. Yes, it's only a few stock voices, but it's a lot closer.
[+] [-] inawarminister|7 years ago|reply
You know, voice call sound quality go full terrible in real life too. If, say, one of my ex-gf or even my sister suddenly call and sounded a bit robotic like this I would believe them to be who they claim to be, no contest.
[+] [-] phkahler|7 years ago|reply
Do you believe this isn't an advertisement for Lyrebird?
[+] [-] bcaa7f3a8bbc|7 years ago|reply
As realtime, realistic voice synthesize is thought to be difficult, a voice/phone call encryption system usually circumvents this problem by using the caller's voice as the proof, as both recognize each other's voice. In most phone encryption systems, like many commercial systems, or the ZRTP protocol by Phil Zimmermann, or the "safety number" in Signal, they allow both parties to read out their pubkey's SHA-256 hash digest (usually encoded to words) aloud, as a mean of verification.
If this type of AI-based voice synthesizer becomes widely-available, it could be disastrous to cryptography. It is not the end-of-the-world of course, as targeted attack with social engineering is not an issue for most people, and those who need this level of security is going to perform out-of-band exchange anyway, but still, the certainty of voice-based key verification would be greatly weakened.
[+] [-] beguiledfoil|7 years ago|reply
Cryptography doesn't solve political problems.
Cryptography solves communication problems. That's. It.
[+] [-] jklein11|7 years ago|reply
It didn't sound like the mom was buying it either. Her tone of voice was somewhere between "I'm going to play along with this" to "Dear god I think my son is on drugs again."
[+] [-] kelvin0|7 years ago|reply
[+] [-] zahrc|7 years ago|reply
Not judging his work here, but it sounds like an unstable VoIP call.
[+] [-] eboyjr|7 years ago|reply
[+] [-] defnotarobot|7 years ago|reply
[+] [-] jasonlfunk|7 years ago|reply
[+] [-] tantalor|7 years ago|reply
In this case, your brain (the "non-artificial intelligence") can take some text and control your vocal chords to emit sound waves to produce speech. You can even learn different voices like a cartoon character voice artist. The artificial intelligence can learn to do the same thing.
[+] [-] taneq|7 years ago|reply
[+] [-] dan678|7 years ago|reply
[+] [-] unknown|7 years ago|reply
[deleted]
[+] [-] PeterStuer|7 years ago|reply
Simple AI accelerated and deskilled the former, and combined with ubiquitous social networks exponentially empowered the latter.
The thing is: these truth undermining technologies need not be perfect to have the effect, just 'good enough' to cast significant doubt and allow near everyone to believe their own 'truths'.
The result will be a highly dis-empathic society, where trust beyond the most closest 'clan' is close to nil and even then some.
Confusion always empowered narcissist and sociopaths, the con-artists and the cultists. It isn't so hard to see anymore how old civilizations could devolve into the dark ages.
[+] [-] gm-conspiracy|7 years ago|reply
We have already seen this work on "lo-fi" ads.
Ads with obvious spelling and grammar errors, meant to immediately engage those who have no concerns of such things (filter out the critical thinkers).
[+] [-] ThinkingGuy|7 years ago|reply
http://jamiedupree.blog.wsbradio.com/2018/06/18/back-on-the-...
[+] [-] butler14|7 years ago|reply
https://youtu.be/MT_u9Rurrqg?t=45s
[+] [-] bufferoverflow|7 years ago|reply
[+] [-] tyingq|7 years ago|reply
http://www.voiptroubleshooter.com/problems/robotic.html
[+] [-] skookumchuck|7 years ago|reply
https://www.youtube.com/watch?v=0ccKPSVQcFk
[+] [-] abraham_lincoln|7 years ago|reply
[+] [-] skookumchuck|7 years ago|reply
I always wondered why emails were evidence. They're just text, anyone could fake an email.
[+] [-] Kostchei|7 years ago|reply
There is a reason for that- they are establishing the possibility that it is actually evidence. Its not clear cut, otherwise we wouldn't have courts. You proffer evidence and convince people of it's weight. And that will continue to be the case.
[+] [-] TeMPOraL|7 years ago|reply
The right question isn't usually if they could, but how likely it is they would. Also, I do hope cases aren't usually hinged on a single e-mail, but include other evidence that together paints a coherent story.
[+] [-] solveit|7 years ago|reply
[+] [-] paul7986|7 years ago|reply
Also the perpetrator is going to have to spoof my exact number to trick friends & relatives. Who I would hope would after speaking to a fake me would then text me soon or a bit after talking with comments & questions.