Ask HN: My wife might lose the ability to speak in 3 weeks – how to prepare?

[+] audiohermit|5 years ago|reply

Hey, speech ML researcher here. Make sure you have different recordings of different contexts. fifteen.ai's best TTS voices use ~90 min of utterances, some separated by emotion. If you're having her read a text, make sure it's engaging--we do a lot of unconscious voicing when reading aloud. Tbh, if she has a non-Anglophone accent, you're going to need more because the training data is biased towards UK/US speakers.

If you want to read up on the basics, check out the SV2TTS paper: https://arxiv.org/pdf/1806.04558.pdf Basically you use a speaker encoding to condition the TTS output. This paper/idea is used all over, even for speech-to-speech translation, with small changes.

There's a few open-source version implementations but mostly outdated--the better ones are either private for business or privacy reasons.

There's a lot of work on non-parallel transfer learning (aka subjects are saying different things) so TTS has progressed rapidly and most public implementations lag a bit behind the research. If you're willing to grok speech processing, I'd start with NeMo for overall simplicity--don't get distracted by Kaldi.

Edit: Important note! Utterances are usually clipped of silence before/after so take that into account when analyzing corpus lengths. The quality of each utterance is much much more important than the length--fifteen.ai's TTS is so good primarily because they got fans of each character to collect the data.

[+] grogenaut|5 years ago|reply

I came here to say this. My brother has a PhD in chemistry and no coding experience. He was able to create a voice model of himself using basic nvidia example generators in a week. My dad lost his voice and it would have been very nice to have a TTS that was much more close to him. I personally would think it would be worth it to have that database.

But obviously also attend to the human matters as well, eg spend time.

[+] lunixbochs|5 years ago|reply

I have an open source web service for rapidly recording lots of text prompts to flac: https://speech.talonvoice.com (right now the live site prompts for single words because I’m trying to build single word training data, but the prompts can be any length)

You can set it up yourself with a bit of Python knowledge from this branch: https://github.com/talonvoice/noise/tree/speech-dataset

There are keyboard shortcuts - up/down/space to move through the list and record quickly.

If you want to use it on arbitrary text prompts, you can modify this function to return each line from a text file: https://github.com/talonvoice/noise/blob/speech-dataset/serv...

If you use this, before recording too much, do some test recordings and make sure they sound ok. Web audio can be unreliable in some browsers.

The uploaded files are named after the short name, so make sure you can correspond the short name with the original text prompts, eg with string_to_shortname().

If you aren’t easily able to do this yourself, I’d be happy to spin up an instance of it for you with text prompts of your choosing.

[+] Veedrac|5 years ago|reply

Also buy a (half) decent mic! They're much cheaper than you might expect.

[+] ChrisGammell|5 years ago|reply

I have about 500 hours of high quality, channel isolated (separate from the person I was speaking to) audio. It comes from my podcast that I have done for many years. It's probably closer to 75-100 hours audio of me actually speaking, since I am more the interviewer.

Is that something that would be useful to a researcher in any context? I am intrigued by the idea of having my voice preserved (you know, ego), but also am happy to donate the sound files if they would help researchers in any way for datasets.

If so: [email protected]

[+] oh_sigh|5 years ago|reply

Make sure to get recording of true honest laughter of hers too

[+] tech4all|5 years ago|reply

Thanks so much for the resources and the well thought out reply!

[+] daanzu|5 years ago|reply

I wrote a simple little Python GUI app to record training audio. Given a text file containing prompts, it will choose a random selection and ordering of them, display them to be dictated by the user, and record the dictation audio and metadata to a .wav file and recorder.tsv file respectively. You can select a previous recording to play it back, delete it, and/or re-record it. It comes with a few selections of sentences designed to cover a broad diverse range of English (Arctic, TIMIT). Pretty simple and no-nonsense.

https://github.com/daanzu/speech-training-recorder

Originally intended for recording data for training speech recognition models [0], it should work just as well for recording to be used for speech synthesis.

[0] https://github.com/daanzu/kaldi-active-grammar

[+] kemiller2002|5 years ago|reply

My mom lost her ability to speak, and what you are going to find is that your life and how you interact with everyone will have to change. Human verbal communication is very fast. She will find it difficult to be part of normal conversations. Without lots of help, she will start to fade into the background of conversations, because she can't keep up. You will have to help her be a part of things. It will be a depressing experience for her, and you will have to help her. People will look at her differently like she is mentally handicapped. (I know she won't be, but people will assume that she is even unconsciously). I recommend finding her a therapist if she has to go through this transition.

[+] pmw|5 years ago|reply

This reminded me of some wonderful writings from Roger Ebert, a brilliant man who also lost his ability to speak.

I cannot find it now, but I believe he wrote about this exact phenomenon: even with the best technology, you cannot communicate as fluently as a conversation demands, so you're relegated to the background.

Here's one of his writings I was able to find: https://www.rogerebert.com/roger-ebert/i-think-im-musing-my-...

[+] metrokoi|5 years ago|reply

Sometimes we have to be reminded that technology can not solve all of our problems. I find one of the issues with my relationships is that I try too hard to help solve their problems. Of course I have good intentions, but focusing on trying to solve the problem can lead to forgetting to do small things similar to what you said about including her in conversations. Think about what she is experiencing. This is the most important comment I have seen.

[+] bergerjac|5 years ago|reply

Seems like a great application for Elon's Neuralink.

[+] fxtentacle|5 years ago|reply

Record her reading the texts of a standardized text training corpus.

That way, you can retrain an existing AI to do text to speech with her own voice.

Edit: here's a link to the corpus that I believe Mozilla uses http://www.openslr.org/12/

[+] asveikau|5 years ago|reply

Is she on board with this? I can imagine a lot of people being severely put off by being asked to record "a corpus of approximately 1000 hours" in advance of what sounds like a stressful surgery.

[+] audiohermit|5 years ago|reply

I'll push back on this. The quality of the read speech should be a higher concern than having parallel data. Unless OP's wife is a teacher or actor/voice actor, if LibriSpeech transcripts are boring, it will come out in the speech.

I think OP would ideally want the model to pick up on more natural intonation, instead of monotone dictation. Record everything from now on, as best you can with similar recording context, and hopefully that data will be enough to cover more natural nuances.

[+] aerovistae|5 years ago|reply

And get a high quality mic to do it with!

[+] trynewideas|5 years ago|reply

Mozilla's is licensed CC-BY, which is pretty liberal. In case the Attribution license is a blocker, here's CMU_ARCTIC's, which is built from copyright-free sources and has no licensing restraints: http://festvox.org/cmu_arctic/

[+] josinalvo|5 years ago|reply

I think this is backwards... This is a corpus to train speech to text, not text to speech, right?

[+] tech4all|5 years ago|reply

Thanks!

[+] Mo3|5 years ago|reply

This right here

[+] Rotten194|5 years ago|reply

I would also suggest looking into learning American Sign Language (of course alongside this project). While communicating via keyboard is workable and good for communicating with the wider world, ASL would be much more convenient for communicating between you two -- and a very interesting language to boot. It is a foreign language thats not related to English besides a few loan words, but there's tons of online resources and most universities have classes as well. Plus, you also can experience beautiful Deaf culture, with a rich storytelling and poetic tradition that blends language, gesture, acting, and pantomime in a way thats just impossible to translate to a spoken language.

The downvoted commenter was being a jerk, but I do think learning ASL is an option worth looking into.

[+] krisoft|5 years ago|reply

I think your answer misses the point of the question. Learning ASL can be done after the surgery if she lost her voice. The question was what can be done now before the surgery. The kind of things which, if it comes to the worst and she loses her voice, cannot be done after.

[+] elil17|5 years ago|reply

I strongly agree with this. Trying to type never as your main form of communication is exhausting. With a sign language, even if you’re not very good at it, you’re having a face to face conversation and you feel a sense of connection.

Also, if you’re not in America, you can learn your local sign language (e.g. British Sign Language, AusLan)

[+] imglorp|5 years ago|reply

Agree. Learning for my spouse. It's fun and easy and there's a ton of resources online and maybe at your local university. You can get good enough to have essential conversation in a few hundred signs. Deep and rapid skill takes study and practice, as you would expect.

[+] zapzupnz|5 years ago|reply

I agree, there is great value in sign languages for people who are unable to speak. (Disclaimer: I am hearing and learnt New Zealand Sign Language)

Obviously, it comes with great effort on both the part of the wife and OP, plus a rethinking of some social interactions and even social groups.

However, no problem is insurmountable with sufficient assistance and support from friends, family, and expert groups. Learning sign language is fun and a great way to meet new friends, hearing and Deaf alike.

It may be a last resort, but it's an option not to be ignored.

[+] quiet_hacker|5 years ago|reply

I have a progressive neurodegenerative disease and lost most my ability to speak about 3 years ago. What you are proposing is super cool, but you might be overthinking this. These things (text to speech, etc) are more awkward than practical in real life. Also, make sure your wife is completely on board. Seeing old clips and hearing my voice is actually kind of depressing to me. Here is my actual advice:

Outside of social situations, it honestly hasn't been that big of deal for me. As a remote developer, my job has remained the same. My managers and co workers have been super supportive. I send messages during meetings to one person who will read it aloud for me.

With text and social media, I still keep up with friends and family. Most medical appointments, etc, can be made online. SprintIP relay is free for deaf/speech impaired, and it allows the caller to type what they want to say and a representative will relay this to the other party. It works via the web or a mobile app. https://www.sprintrelay.com/sprintiprelay

Banks, brokers, or anything involving personal info (like SS#) usually requires a voice phone call. I have my wife call and explain the situation. I can whisper yes, as they occasionally require me to give permission. Some call center representatives have no idea how to handle this situation, and will just stick to the script saying they have to speak to me the entire time. My wife just thanks them, calls back, and hopes for someone more understanding.

There are awkward encounters where people don't know you can't speak, and will respond by speaking louder and slower. These people will also assume you are not intelligent and be dismissive. This is just one of the things you have to deal with.

I sincerely hope the procedure goes well and you wife doesn't have to deal with this. Just know that even if the worse happens, she can have a normal and productive life!

[+] aspaceman|5 years ago|reply

> There are awkward encounters where people don't know you can't speak, and will respond by speaking louder and slower. These people will also assume you are not intelligent and be dismissive. This is just one of the things you have to deal with.

It sucks you have to just deal with it.

[+] sesuximo|5 years ago|reply

That’s terrible about the call centers who need verbal confirmation. Crazy that they didn’t set up an alternative.

[+] civilian|5 years ago|reply

How do you communicate with your wife?

Did you ever consider learning sign language?

[+] happycry|5 years ago|reply

We get quite a few requests for this at Resemble (https://resemble.ai). We can get her to record right on our website or you can upload an existing file (along with a video of her consent) on the platform. Feel free to shoot me a message and I'd be happy to help build a voice for her.

[+] cdolan|5 years ago|reply

I dont know how to send messages but I researched this space a few years ago. Unfortunately a family member of mine had a surgery result in loss of his speech.

We have a lot of tapes around of his voice, from voice mails to family videos to some things from his work. If you are open to reaching out that would be awesome, I’ll check out the site as well.

Edit: I’ve wanted to make some sort of soundboard + “text to talk” setup for this family member. He often can’t participate in conversations because he writes on a whiteboard, and the speed of chatter moves faster than his writing

[+] louwhopley|5 years ago|reply

Wow, this looks like a great service!

Out of interest what are the average response times to generate a clip of one or two sentences from a configured voice?

Imagining the easy text-to-speech solution the OP could build on this resemble API.

[+] archon810|5 years ago|reply

Just FYI, your page keeps jumping on mobile as it renders and erases words. Not a good experience if I'm trying to read.

[+] mattlondon|5 years ago|reply

I don't know if you have kids/grandkids/nieces or nephews (or plan to have those) but it might be nice to record your wife reading some books out loud.

Not only will you have your own personal "audio books" of Harry Potter/The Hobbit/Chronicles of Narnia/Oi Frog/Alice in Wonderland/Roald Dahls etc etc for any kids/grandkids/relatives etc that will hopefully be something treasured in its own right, but you'll also have a large corpus of training data from well-known texts that you can retrain over and over as the tech improves in the future. Might be worth chucking in some other well-known texts to avoid over-fitting on a "kids' story voice" - maybe something plain like inauguration speeches/declaration of independence/magna carta/etc.

Obviously I'd focus on gathering raw material now, and focus on the reconstruction later when you've all recovered mentally and physically to whatever happens. The more data the better when it comes to this sort of thing. There might not be something "simple" right now (e.g. you could probably implement the WaveNet or similar paper yourself today, and training it up on some GPUs in your spare room etc, but in a few years there might be a nice WYSIWYG/SaaS thing for it), but with the recordings safely stored you'll obviously be able to use it in the future.

Best of luck to you both.

[+] Zenbit_UX|5 years ago|reply

I like this idea but the specific examples you give would almost certainly be a terrible idea. A voice trained on Tolkien or old American legalese like the Magna Carta would train a model with a lot of thee, thus, therefore and though art and undertrain it with modern English. His wife would sound like the second coming of Jesus or Shakespeare and less like a normal human being.

[+] kerkeslager|5 years ago|reply

I don't have any answers to give you, but I want to say that this is a really loving and beautiful thing you're trying to do.

[+] covercash|5 years ago|reply

Other resources you may want to explore are r/mute and r/deaf subreddits. Both also have Discord servers listed in the sidebars.

Having spent a good deal of time in hospitals, a few things I recommend... 10’ phone cable since outlets can sometimes be far from the bed, cheap slippers she can wear to walk around (stepping in a hospital hallway mystery puddle wearing just socks is very unpleasant), comfy clothes that you don’t mind having ruined (T-shirts, underwear, shirts, pajama pants - they can temporarily unhook the IV so she can put a T-shirt on), earplugs, eye mask. If she’s going to be on liquid-only diet, bring your own since hospital food is not great, not terrible. Soylent/Orgain/Ensure if she’s permitted that, otherwise good quality Italian ices are such a nice treat and most hospitals have a patient fridge/freezer you can store them in. Broth, but go to a restaurant or grocery store/farmers market with hot soup bar and fill a container with just the broth from the chicken noodle soup. It’s INFINITELY better than boxed broth.

Hopefully all of your research and preparation will be for nothing, I wish you and your wife a successful surgery!

[+] dawg-|5 years ago|reply

Speech-language Pathology student here. I would recommend going to see a speech therapist. It will likely be covered by your health insurance. Find an SLP who specializes in AAC (Augmented and Alternative Communication) who can help your wife communicate if she loses her speech. Your DIY approach could work, but having support from an SLP to help her learn the system, and come up with other options if it doesn't cover all of her communication needs, will go a long way.

[+] coronadisaster|5 years ago|reply

Just have her carry a good microphone at all times to record everything she says until that point, to have a maximum amount of samples. If you can't "deepfake" it today, maybe you will be able to do it tomorrow, but at least you will have the data.

[+] korethr|5 years ago|reply

Others here are addressing technical solutions, but I don't see anyone here covering non-verbal communication. IMO, that's going to be just as important.

I am going to assume that your wife and you have a healthy relationship with strong communication, in part because you've developed an intuition for her body language and other non-verbal communication methods. In the scenario where she loses her ability to speak, even if she happily and completely takes to whatever technical solution(s) you offer to replace that, I think it's likely she will reflexively lean more heavily on those non-verbal channels, and you're going to need to get better at reading them than you are now.

[+] uberman|5 years ago|reply

This might get you started:

https://speech.microsoft.com/customvoice

I imagine if MS offers custom voices then the other text to speech providers do as well.

Good luck

[+] tech4all|5 years ago|reply

Thank you - great lead.

[+] thaumasiotes|5 years ago|reply

Some (decades old) research on this involved a research team creating a video of JFK saying "I never met Forrest Gump". I found a writeup in Google Books: https://books.google.com/books?id=mQtGVQeQplcC&pg=PA208&lpg=...

> We evaluated our Kennedy results qualitatively along the following dimensions: ... naturalness of the composited articulation; ...

Obviously the state of the art will have advanced, but maybe this can point the way toward more current research.

While I tend to agree with everyone else that this can be a great idea, my instinct is to float the idea to your wife first and see how she responds. I can imagine someone taking this negatively.

[+] foepys|5 years ago|reply

There is a YouTube channel called "Speaking of AI" that makes short fake speeches of some US public figures. The quality is quite good and a bit frightening.

https://www.youtube.com/channel/UCID5qusrF32kSj-oSGq3rJg/vid...

[+] watertom|5 years ago|reply

If she loses her ability to speak there are many ways to help her out, but nothing can replace the sound of her voice, especially for those important moments.

Just in case. Record specific messages for various people in her life, that can be used repeatedly, Children, Mom, Dad, siblings, in-laws, friends, messages like: "X, I love you", "X, I miss you.", "Mommy loves you!" "Give me a hug". "Holiday Greeting", "Happy Birthday","I'm so proud of you!" favorite happy saying, frustration saying,

You get the idea.

[+] arethuza|5 years ago|reply

What about recording messages to other people for future events (e.g. graduation of a child, birth of grandchild etc.)?

Recording a message to a yet unborn grandchild is maybe something we could all do!

[+] jasonhn9999|5 years ago|reply

When my dad lost his speech, we had Boogie Board Jot devices all over the house. It made writing short notes and simple dialogs much less tedious.

We also used the Verbally premium iPad app to help give him a voice and make transactions on easier.

Wishing you all the best.

[+] fxtentacle|5 years ago|reply

The paper "Generalization Of Audio Deepfake Detection" gives an overview.

The paper https://arxiv.org/abs/1904.05441 has a list of spoofing methods.

Here's one method as paper https://arxiv.org/pdf/1806.04558.pdf

And here on GitHub https://github.com/CorentinJ/Real-Time-Voice-Cloning

[+] probably_wrong|5 years ago|reply

For an open-source approach, the MaryTTS project has a guide on how to add new voices to their tool: https://github.com/marytts/marytts/wiki/VoiceImportToolsTuto...

[+] mbreese|5 years ago|reply

You may want to look up what was done for Roger Ebert. He has lost his voice due to surgery, but because of the vast corpus of audio recordings of him, a viable text to speech engine was able to be created.

It’s a bit dated at this point, but I imagine the research has vastly improved since then.

It’s a very good question though. A decade ago this was able to be done for one man. Is it now possible to be done for anyone? Like others, I’d guess the first step is to record everything while you can.

217 comments