Deep Learning for Siri’s Voice

[+] StavrosK|8 years ago|reply

The iOS 11 Siri sounds like it's a real person talking, it's amazing. Does anyone know if there's an open-source TTS library available with such quality (or if anyone is working on one, from this paper)?

I would love to have my home speakers announce things in this voice.

[+] knolan|8 years ago|reply

She sounds younger to me, but very natural sounding.

Will be interesting to see how Siri on Home Pod works out.

[+] dmix|8 years ago|reply

I'd love to have my Instapaper articles read to me in that TTS voice.

Hopefully it gets ported to MacOS's say CLI utility. I typically use that with `pbpaste | say` to read my articles.

[+] beisner|8 years ago|reply

A research paper published by Apple? About Siri?! Unheard of! Last time I was at an NLP conference wth Apple employees they wouldn't say anything about how Siri speech worked, despite being very inquisitive about everyone else's publications. Good to see some change.

[+] edwhitesell|8 years ago|reply

It's probably safe to assume a lot of that was due to some/most of Siri being licensed from Nuance initially. I mean, who wants to talk about a new product, which most people think is brand new and entirely innovative, just to say "Oh yeah, we paid someone else to work with us to create it."

Not that there's anything wrong with that and it certainly seems like Apple has been investing in-house pretty heavily in recent years for Siri improvement.

[+] captain_murdock|8 years ago|reply

You can read some more discussion on it here: https://news.ycombinator.com/item?id=14804018.

One of the ideas is that most ML researchers want to publish their work and Apple wasn't allowing it. Allowing ML researchers working at Apple to publish in this journal was they only way they were get more ML researchers to work for them.

[+] jclardy|8 years ago|reply

Yeah, they do seem to be opening up a bit. They posted their first article on this ML blog a few weeks ago.

[+] sarabande|8 years ago|reply

Also glad to see this. Still curious as to why they wouldn't post it as a research paper on arXiv -- what's the point in reinventing the wheel here? I suppose it's nice for publicity, but would be great if they also played nicely with the ecosystem.

[+] unknown|8 years ago|reply

[deleted]

[+] jchw|8 years ago|reply

My favorite part is that the runtime runs on device. I moved back to Android, but persistently one thing Apple does that I like is they don't move things to the internet as often as Google does. On Android, you get degraded TTS if the internet is shoddy.

[+] zionic|8 years ago|reply

It's two different philosophies. With Apple it's about providing sufficient value such that the consumer will pay a premium for the product. With Google it's about providing the minimum viable value such that the user will provide as much of their data as possible.

[+] MBCook|8 years ago|reply

That makes sense. I always wondered why improvements to Siri's voice required updating the device. I figure it had to be running there, just getting the text from the web service and not the audio.

[+] TazeTSchnitzel|8 years ago|reply

On iOS, by contrast, TTS quality depends on available disk space. If there's too little of it, iOS removes Siri's higher-quality voice files.

[+] X86BSD|8 years ago|reply

Their TV service does this too on Google Fiber. It sucks, it's a tire fire. It is laggy, wedges until you have to reboot the network and TV box. It's a horrible experience. I really wish they would stop trying to shove everything into the cloud. Seriously Google, STOP. JUST STOP!

[+] Eridrus|8 years ago|reply

Fundamentally, offline TTS/ASR/NLP is going to be degraded because you can't fit cloud-sized models onto a mobile device.

Could offline models be better? Definitely. But they only way to make them as good as cloud models it to make the cloud models worse.

[+] quiteawhile|8 years ago|reply

I couldn't read the paper yet, and also I know very little about this, but listening to the audio samples it seems that one of the most notable changes was the intonation in changing phrases. Did anyone else catch something like that? I'm not sure I'm doing a good job at explaining. If you listen to all iOS11 samples it'll stand out.

Anyway, it's the only way I can still identify this as a fake voice. The intonation always follows the same cadence (not sure if that's the word?). We really shouldn't have overused the word awesome before this kind of thing came along.

There's also a kind of dread too, tbh, this kind of seamless TTS has the potential to change a lot of things. First of all criminals are going to love this, youtube pranksters too. Eventually this will shake up the voice acting industry in a possibly not healthy way for the voice actors, while at the same time allowing projects with a shorter budget to have incredible voice work (also dubbing).

What I think is really important, tho, is that as we move away from the uncanny valley we change our relationships with those voices, our brains don't have the capacity to listen to a voice this real and not imagine it as a person, even for adults.

Ironically at this moment I'm using an old threadless sweatshirt that says "this was supposed to be the future" but nowadays I can honestly say we're getting there.

[+] lawkwok|8 years ago|reply

Regarding voice acting, I think there is something to be said about human expression/ad-lib. Sure, you could generate a natural-sounding voice computer voice, but in the context of arts we’re still a ways to go before a computer can go off script and add just the perfect amount of intonation on a certain word that turns a phrase into an iconic quote.

Similarly, we don’t see CGI motion capture replacing Andy Serkis any time soon.

[+] ghaff|8 years ago|reply

I think you're overstating things. On the one hand, a lot of applications where quality wasn't that critical switched over ages ago. And, on the other hand, any application that would have spent the money on voice acting is still going to pay for both the higher quality and for a sound that isn't the same as everyone else is using. (Note that Siri's new iOS voice is based on a new training set from a new person.)

I do think there are applications that we don't just have today because TTS just isn't good enough. I've had some ideas around Alexa apps related to content that would be TTSd. But the current Polly just isn't human enough. I don't think this is there yet either but it's getting close.

[+] coldcode|8 years ago|reply

The difference between the Siri voices from iOS 9-11 is startling. I can still here some issues especially at the ends of phrases, but it's extremely good.

[+] pault|8 years ago|reply

11 sounds almost as good as the wavenet demo. Considering it runs in real-time that's very impressive.

[+] MBCook|8 years ago|reply

I hope someone makes a YouTube video going back to when Siri first launched to show just how much it's evolved.

Listening to those samples I remember how big an advancement iOS 10 felt, but it's nothing compared to 11.

[+] default-kramer|8 years ago|reply

This just made me realize that every time you see a strong AI in fiction, it still has a computer-sounding voice. If we ever develop strong AI, we will probably already have perfectly natural speech synthesis. And if not, the AI could develop it for us.

But I suppose an AI might choose to use a computer-sounding voice to remind us that it is a computer. Kind of like those inaccurate sound effects in movies - they have become so common that it seems more wrong to omit them. (TV Tropes calls this "The Coconut Effect".)

[+] banderman|8 years ago|reply

I recommend watching the scifi film "Her", it has a different take on this.

[+] digi_owl|8 years ago|reply

Anyone else find themselves thinking about Data, and why he was portrayed the way he was?

[+] sib|8 years ago|reply

The prosody and and continuity of the speech is dramatically improved. This is hard to do and very impressive (especially given that it is being done on-device).

Personally, I'm less pleased with the actual new voice itself, although that is more a subjective judgment. After listening to many hundreds of voice talent auditions for Alexa, it's hard to step back from that level of pickiness.

[+] ghaff|8 years ago|reply

As I indicated in another comment, the visual that the voice (together with other tweaks in some of Siri's responses) suggests to me is a perky twenty-something.

I actually tend to generally prefer some of the female British accents in several current TTS systems. (Amy is probably my favorite Polly voice.) Perhaps as an American, the robotic-ness doesn't seem quite as obvious or grating.

[+] goespro_tocall|8 years ago|reply

How'd you get to listen to many hundreds of voice talent audition for Alexa?

[+] zenojevski|8 years ago|reply

Story time?

[+] ucaetano|8 years ago|reply

Kinda sad to see that the names of the authors are omitted, although you can infer some of them from the quote:

> For more details on the new Siri text-to-speech system, see our published paper “Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System”

[9] T. Capes, P. Coles, A. Conkie, L. Golipour, A. Hadjitarkhani, Q. Hu, N. Huddleston, M. Hunt, J. Li, M. Neeracher, K. Prahallad, T. Raitio, R. Rasipuram, G. Townsend, B. Williamson, D. Winarsky, Z. Wu, H. Zhang. Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System, Interspeech, 2017.

Why not just add the names by default?

[+] pault|8 years ago|reply

It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.

[+] CharlesW|8 years ago|reply

> It might seem silly, but I'm looking forward to the first AI talk therapist. Most of the benefit of therapy is the talking, so it's not as crazy as it sounds.

Not crazy at all. At least some therapies provide benefits even with simple non-AI processes: "A meta-analyses of 15 studies, published in this month’s volume of Administration and Policy in Mental Health and Mental Health Services Research, found no significant difference in the treatment outcomes for patients who saw a therapist and those who followed a self-help book or online program."[0]

[0] https://qz.com/1057345/researchers-say-you-might-as-well-be-...

[+] briandear|8 years ago|reply

At my little company, iCouch, we have experimented with such things, but to actually make it effective — that requires a good amount of capital — capital that is very difficult to raise. I would need to hire 3 full time people just for the AI project and potentially more.

The VC world is interested in “traction” and not novel tech which means we have to divert effort into growing customers for our mental health practice management system to get “traction” before we can spend any notable time building AI therapists. As much as VCs talk about “looking for innovation” they really aren’t. They are just looking at current growth/revenue. The days of building something amazing and monetizing later seem to be over for all except for founders with marquee names.

We could launch AI therapists within a year, but in the meantime, I have to pay my team. So we are forced to subsidize moonshot R&D with our existing sales — but that is hard to do since existing sales have to finance customer acquisition. Finding an additional $500k per year to make AI therapy viable is impossible for us.

We are in a catch 22. The first question from nearly every investor’s mouth: “how many paid users do you have?” Not, what technology do you have or can develop that is truly disruptive. We could start preparing AI therapy tomorrow for a Summer 2018 launch if we could afford it. But if we diverted resources to that, we’d be out of business long before launch. Clinically effective AI therapy isn’t a weekend side project.

[+] Keyframe|8 years ago|reply

ELIZA was already there.

[+] tiggybear|8 years ago|reply

The other day I was thinking about Uber Therapy, get a mini therapy session on the way to your destination!

[+] microtherion|8 years ago|reply

There are ongoing efforts in this direction, e.g. this paper from the just concluded Interspeech 2017: http://bit.ly/2wBgLKC

[+] oefrha|8 years ago|reply

Heard of Emacs's M-x doctor?

[+] andreyk|8 years ago|reply

Good blog post and audio samples notwithstanding, annoying that they don't put the paper on Arxiv. As they themselves point to in the blog post, the learning architecture was introduced in 2014's "Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis" so it's not clear how much of this is just good engineering vs novel research.

[+] dkonofalski|8 years ago|reply

The paper was more than likely embargoed until the talk they gave about it was over. They're introducing some new things that they probably didn't want to release details on before they publicly made a statement.

[+] astrange|8 years ago|reply

The paper is here: http://www.isca-speech.org/archive/Interspeech_2017/abstract...

[+] gok|8 years ago|reply

The big difference here is the application to unit selection synthesis as opposed to parametric synthesis.

[+] speakingmachine|8 years ago|reply

The obvious question would be a head-to-head qualitative comparison vs. WaveNet. It seems that they have advanced siri vs. siri prior, but does this work advance the field?

[+] dharma1|8 years ago|reply

in terms of being feasible to actually use in production? Yes. It runs realtime locally on a mobile device at 48khz 16bit. WaveNet doesn't run realtime even on a desktop GPU at 16khz 8bit.

The WaveNet method of predicting the output sample by sample yields great results but at a very high computational cost

[+] chiph|8 years ago|reply

There's no question the diction of iOS 11 is much improved. But I liked the voice & timbre of the old speaker better - it sounds more authoritative.

[+] TazeTSchnitzel|8 years ago|reply

Yes, it's a shame they didn't hire her to do the iOS 11 voice.

[+] BadassFractal|8 years ago|reply

Now if only it didn't feel like when I'm asking Siri to do a task it has a very small pool of pre-set options I get to choose from. It still feels rather restricted, but I'm excited they're really investing into it.

[+] remir|8 years ago|reply

The new voice sounds a lot like Google's current TTS voice.

[+] sangd|8 years ago|reply

I don't like the higher pitch/sharp tone from iOS 11. I like a warmer and deeper tone in iOS 10. I feel like having a more mature/experience assistant.

[+] EGreg|8 years ago|reply

It's also interesting how they made the pitch higher for the new voice, like Google has had all along.

[+] satyajeet23|8 years ago|reply

This is amazing, and also how beautifully it is written and presented!

[+] seldomrandom|8 years ago|reply

Siri's voice update and not allowing apps to use location always were two of my favorites in iOS 11!

90 comments