Ask HN: My wife might lose the ability to speak in 3 weeks – how to prepare?
855 points| tech4all | 5 years ago
My ideal would be an open source "deepfake toolkit" that allows me to provide pre-recorded samples of her speech and then TTS in her voice. Unfortunately most articles and tools I'm finding are anti-deepfake. Any recommendations?
Fallback would be recording her speaking "phonetic pangrams" and then using her pre-recorded phonemes to recreate speech that sounds like her. I feel like the deepfake toolkit is the way to go. Appreciate any recommendations... There must be open source tools for this??
[+] [-] audiohermit|5 years ago|reply
If you want to read up on the basics, check out the SV2TTS paper: https://arxiv.org/pdf/1806.04558.pdf Basically you use a speaker encoding to condition the TTS output. This paper/idea is used all over, even for speech-to-speech translation, with small changes.
There's a few open-source version implementations but mostly outdated--the better ones are either private for business or privacy reasons.
There's a lot of work on non-parallel transfer learning (aka subjects are saying different things) so TTS has progressed rapidly and most public implementations lag a bit behind the research. If you're willing to grok speech processing, I'd start with NeMo for overall simplicity--don't get distracted by Kaldi.
Edit: Important note! Utterances are usually clipped of silence before/after so take that into account when analyzing corpus lengths. The quality of each utterance is much much more important than the length--fifteen.ai's TTS is so good primarily because they got fans of each character to collect the data.
[+] [-] grogenaut|5 years ago|reply
But obviously also attend to the human matters as well, eg spend time.
[+] [-] lunixbochs|5 years ago|reply
You can set it up yourself with a bit of Python knowledge from this branch: https://github.com/talonvoice/noise/tree/speech-dataset
There are keyboard shortcuts - up/down/space to move through the list and record quickly.
If you want to use it on arbitrary text prompts, you can modify this function to return each line from a text file: https://github.com/talonvoice/noise/blob/speech-dataset/serv...
If you use this, before recording too much, do some test recordings and make sure they sound ok. Web audio can be unreliable in some browsers.
The uploaded files are named after the short name, so make sure you can correspond the short name with the original text prompts, eg with string_to_shortname().
If you aren’t easily able to do this yourself, I’d be happy to spin up an instance of it for you with text prompts of your choosing.
[+] [-] Veedrac|5 years ago|reply
[+] [-] ChrisGammell|5 years ago|reply
Is that something that would be useful to a researcher in any context? I am intrigued by the idea of having my voice preserved (you know, ego), but also am happy to donate the sound files if they would help researchers in any way for datasets.
If so: [email protected]
[+] [-] oh_sigh|5 years ago|reply
[+] [-] tech4all|5 years ago|reply
[+] [-] daanzu|5 years ago|reply
https://github.com/daanzu/speech-training-recorder
Originally intended for recording data for training speech recognition models [0], it should work just as well for recording to be used for speech synthesis.
[0] https://github.com/daanzu/kaldi-active-grammar
[+] [-] kemiller2002|5 years ago|reply
[+] [-] pmw|5 years ago|reply
I cannot find it now, but I believe he wrote about this exact phenomenon: even with the best technology, you cannot communicate as fluently as a conversation demands, so you're relegated to the background.
Here's one of his writings I was able to find: https://www.rogerebert.com/roger-ebert/i-think-im-musing-my-...
[+] [-] metrokoi|5 years ago|reply
[+] [-] bergerjac|5 years ago|reply
[+] [-] fxtentacle|5 years ago|reply
That way, you can retrain an existing AI to do text to speech with her own voice.
Edit: here's a link to the corpus that I believe Mozilla uses http://www.openslr.org/12/
[+] [-] asveikau|5 years ago|reply
[+] [-] audiohermit|5 years ago|reply
I think OP would ideally want the model to pick up on more natural intonation, instead of monotone dictation. Record everything from now on, as best you can with similar recording context, and hopefully that data will be enough to cover more natural nuances.
[+] [-] aerovistae|5 years ago|reply
[+] [-] trynewideas|5 years ago|reply
[+] [-] josinalvo|5 years ago|reply
[+] [-] tech4all|5 years ago|reply
[+] [-] Mo3|5 years ago|reply
[+] [-] Rotten194|5 years ago|reply
The downvoted commenter was being a jerk, but I do think learning ASL is an option worth looking into.
[+] [-] krisoft|5 years ago|reply
[+] [-] elil17|5 years ago|reply
Also, if you’re not in America, you can learn your local sign language (e.g. British Sign Language, AusLan)
[+] [-] imglorp|5 years ago|reply
[+] [-] zapzupnz|5 years ago|reply
Obviously, it comes with great effort on both the part of the wife and OP, plus a rethinking of some social interactions and even social groups.
However, no problem is insurmountable with sufficient assistance and support from friends, family, and expert groups. Learning sign language is fun and a great way to meet new friends, hearing and Deaf alike.
It may be a last resort, but it's an option not to be ignored.
[+] [-] quiet_hacker|5 years ago|reply
Outside of social situations, it honestly hasn't been that big of deal for me. As a remote developer, my job has remained the same. My managers and co workers have been super supportive. I send messages during meetings to one person who will read it aloud for me.
With text and social media, I still keep up with friends and family. Most medical appointments, etc, can be made online. SprintIP relay is free for deaf/speech impaired, and it allows the caller to type what they want to say and a representative will relay this to the other party. It works via the web or a mobile app. https://www.sprintrelay.com/sprintiprelay
Banks, brokers, or anything involving personal info (like SS#) usually requires a voice phone call. I have my wife call and explain the situation. I can whisper yes, as they occasionally require me to give permission. Some call center representatives have no idea how to handle this situation, and will just stick to the script saying they have to speak to me the entire time. My wife just thanks them, calls back, and hopes for someone more understanding.
There are awkward encounters where people don't know you can't speak, and will respond by speaking louder and slower. These people will also assume you are not intelligent and be dismissive. This is just one of the things you have to deal with.
I sincerely hope the procedure goes well and you wife doesn't have to deal with this. Just know that even if the worse happens, she can have a normal and productive life!
[+] [-] aspaceman|5 years ago|reply
It sucks you have to just deal with it.
[+] [-] sesuximo|5 years ago|reply
[+] [-] civilian|5 years ago|reply
Did you ever consider learning sign language?
[+] [-] happycry|5 years ago|reply
[+] [-] cdolan|5 years ago|reply
We have a lot of tapes around of his voice, from voice mails to family videos to some things from his work. If you are open to reaching out that would be awesome, I’ll check out the site as well.
Edit: I’ve wanted to make some sort of soundboard + “text to talk” setup for this family member. He often can’t participate in conversations because he writes on a whiteboard, and the speed of chatter moves faster than his writing
[+] [-] louwhopley|5 years ago|reply
Out of interest what are the average response times to generate a clip of one or two sentences from a configured voice?
Imagining the easy text-to-speech solution the OP could build on this resemble API.
[+] [-] archon810|5 years ago|reply
[+] [-] mattlondon|5 years ago|reply
Not only will you have your own personal "audio books" of Harry Potter/The Hobbit/Chronicles of Narnia/Oi Frog/Alice in Wonderland/Roald Dahls etc etc for any kids/grandkids/relatives etc that will hopefully be something treasured in its own right, but you'll also have a large corpus of training data from well-known texts that you can retrain over and over as the tech improves in the future. Might be worth chucking in some other well-known texts to avoid over-fitting on a "kids' story voice" - maybe something plain like inauguration speeches/declaration of independence/magna carta/etc.
Obviously I'd focus on gathering raw material now, and focus on the reconstruction later when you've all recovered mentally and physically to whatever happens. The more data the better when it comes to this sort of thing. There might not be something "simple" right now (e.g. you could probably implement the WaveNet or similar paper yourself today, and training it up on some GPUs in your spare room etc, but in a few years there might be a nice WYSIWYG/SaaS thing for it), but with the recordings safely stored you'll obviously be able to use it in the future.
Best of luck to you both.
[+] [-] Zenbit_UX|5 years ago|reply
[+] [-] kerkeslager|5 years ago|reply
[+] [-] covercash|5 years ago|reply
Having spent a good deal of time in hospitals, a few things I recommend... 10’ phone cable since outlets can sometimes be far from the bed, cheap slippers she can wear to walk around (stepping in a hospital hallway mystery puddle wearing just socks is very unpleasant), comfy clothes that you don’t mind having ruined (T-shirts, underwear, shirts, pajama pants - they can temporarily unhook the IV so she can put a T-shirt on), earplugs, eye mask. If she’s going to be on liquid-only diet, bring your own since hospital food is not great, not terrible. Soylent/Orgain/Ensure if she’s permitted that, otherwise good quality Italian ices are such a nice treat and most hospitals have a patient fridge/freezer you can store them in. Broth, but go to a restaurant or grocery store/farmers market with hot soup bar and fill a container with just the broth from the chicken noodle soup. It’s INFINITELY better than boxed broth.
Hopefully all of your research and preparation will be for nothing, I wish you and your wife a successful surgery!
[+] [-] dawg-|5 years ago|reply
[+] [-] coronadisaster|5 years ago|reply
[+] [-] korethr|5 years ago|reply
I am going to assume that your wife and you have a healthy relationship with strong communication, in part because you've developed an intuition for her body language and other non-verbal communication methods. In the scenario where she loses her ability to speak, even if she happily and completely takes to whatever technical solution(s) you offer to replace that, I think it's likely she will reflexively lean more heavily on those non-verbal channels, and you're going to need to get better at reading them than you are now.
[+] [-] uberman|5 years ago|reply
https://speech.microsoft.com/customvoice
I imagine if MS offers custom voices then the other text to speech providers do as well.
Good luck
[+] [-] tech4all|5 years ago|reply
[+] [-] thaumasiotes|5 years ago|reply
> We evaluated our Kennedy results qualitatively along the following dimensions: ... naturalness of the composited articulation; ...
Obviously the state of the art will have advanced, but maybe this can point the way toward more current research.
While I tend to agree with everyone else that this can be a great idea, my instinct is to float the idea to your wife first and see how she responds. I can imagine someone taking this negatively.
[+] [-] foepys|5 years ago|reply
https://www.youtube.com/channel/UCID5qusrF32kSj-oSGq3rJg/vid...
[+] [-] watertom|5 years ago|reply
Just in case. Record specific messages for various people in her life, that can be used repeatedly, Children, Mom, Dad, siblings, in-laws, friends, messages like: "X, I love you", "X, I miss you.", "Mommy loves you!" "Give me a hug". "Holiday Greeting", "Happy Birthday","I'm so proud of you!" favorite happy saying, frustration saying,
You get the idea.
[+] [-] arethuza|5 years ago|reply
Recording a message to a yet unborn grandchild is maybe something we could all do!
[+] [-] jasonhn9999|5 years ago|reply
We also used the Verbally premium iPad app to help give him a voice and make transactions on easier.
Wishing you all the best.
[+] [-] fxtentacle|5 years ago|reply
The paper https://arxiv.org/abs/1904.05441 has a list of spoofing methods.
Here's one method as paper https://arxiv.org/pdf/1806.04558.pdf
And here on GitHub https://github.com/CorentinJ/Real-Time-Voice-Cloning
[+] [-] probably_wrong|5 years ago|reply
[+] [-] mbreese|5 years ago|reply
It’s a bit dated at this point, but I imagine the research has vastly improved since then.
It’s a very good question though. A decade ago this was able to be done for one man. Is it now possible to be done for anyone? Like others, I’d guess the first step is to record everything while you can.