Microsoft turns spoken English into spoken Mandarin – in the same voice

[+] tokenadult|13 years ago|reply

To someone who spent years learning Chinese as a second language, and then made my living for years as a Chinese-English interpreter, that was pretty impressive.

The economics of the issue is that a machine interpreter just has to be as good as a human interpreter at the same cost. That's a reachable target with today's computer technology. EVERY time I've heard someone else interpreting English or Chinese into the other language, I have heard mistakes, and I am chagrined to remember mistakes that I made over the years. We can't count on error-free machine interpretation between any pair of languages (human language is too ambiguous in many daily life cases for that), but if companies develop tested, validated software solutions for consecutive interpreting (what I usually did, and what is shown in the video) or simultaneous interpreting (the harder kind of interpreting in demand at the United Nations, where even in the best case it is not always done well), then those companies will be able to displace a lot of human professionals who rely on their language ability to make a living.

Right now a lot of interpreters in the United States make a lot of part-time income from gigs that involve suddenly getting telephone calls and joining in to interpret a telephone conversation in two languages. This is often necessary, for example, for physician interviews of patients in emergency rooms or pharmacist consultations with patients buying prescribed drugs (where I last saw a posted notice on how to access such an interpretation service). The IBM Watson project is already targeted at becoming an expert system for medical diagnosis, and patient care markets will surely provide a lot of income for further development of software interpretation between human languages.

It's still good for human beings to spend the time and effort to learn another human language (as so many HN participants have by learning English as a second language). That's a broadening experience and an intellectual delight. But just as riding horses is more a form of recreation these days than a basis for being employed, so too speaking another language will be a declining factor in seeking employment in the next decade.

[+] qq66|13 years ago|reply

I don't think there will be much of an impact on the interpreter industry until the machine translations are significantly better than human translations.

Human translators are so expensive today that they are only used in situations where the translation has to be correct -- diplomacy, courtrooms, books, etc. Until a machine is much better than a human, these use cases won't switch to machine translation (similarly, self-driving cars won't be allowed until they are proven to be much safer than human drivers).

On the other hand, there's a large casual market for machine translations today for situations like reading foreign Web sites, chatting with people in different countries, reading Tweets in a different language, etc.

[+] sneak|13 years ago|reply

> That's a broadening experience and an intellectual delight.

I disagree. It's only broadening in the additional people it allows you to commune with. Other than that, it's a waste of time.

Having to convert between languages (I'm a native speaker of English who lives in Germany) all the time is huge overhead, sort of like if every country had its own system of measurement, except that the overhead is incurred much more often, not just for measuring things.

[+] paulgb|13 years ago|reply

This is the second time Deep Neural Network research from the University of Toronto has made the front page, the first being when it won first-place in a Kaggle competition http://news.ycombinator.com/item?id=4733335

[+] FrojoS|13 years ago|reply

Here is a GREAT talk by Geoffrey Hinton (the Prof running said lab) http://www.youtube.com/watch?v=DleXA5ADG78&hd=1 where he explains the method.

Unfortunately, even though it was posted three times to HN http://www.hnsearch.com/search#request/all&q=sex+machine... it never made the fron page.

Here is the my summary and comment: " Great talk. I don't know much about artificial neural networks (ANN) and even less about natural ones, but I have the feeling that I learnt a lot from this video.

If I understand correct, Hinton uses so many artificial neurons compared to the amount of learning data, that you would usually see an overfitting effect. However, his ANN's randomly shut of a substantial part (~50%) of the neurons during each learning iteration. He calls this "dropout". Therefore, a single ANN represents many different models. Most models never get trained, but they exist in the ANN, because they share their weights with the trained models. This learning method avoids over specializing and therefore improves robustness with respect to new data but it also allows for arbitrary combination of different models which tremendously enlarges the pool of testable models.

When using or testing these ANNs you also "dropout" neurons during every prediction. Practically, every rerun predicts a different result by using a different model. Afterwards, these results are averaged. The more results, the higher the chance, that the classification is correct.

Hinton argues, that our brains work in a similar way. This explains among other things a) Why are neurons firing in a random manner? It's an equivalent implementation to his "dropout" where only a part of the neurons is used at any given time. b) Why does spending more time on a decision improve the likely hood of success? Even though there might be more at work, his theory alone is able to explain the effect. The longer you think, the more models you test, simply by rerunning the prediction. The more such predictions the higher the chance, that the average prediction is correct.

To me, the latter also explains in an intuitive way, why the "wisdom of the crowds" works well when predicting events that many people have an, halfway sophisticated, understanding of. Examples are betting on sport events or movies box office success. As far as I know, no single expert beats the "wisdom of the crowd" in such cases.

What I would like to know is, how many, random model based predictions do you need until the improvement rate becomes insignificant? In other words, would humans act much smarter if they could afford more time to think about decisions? Put another way, does the "wisdom of the crowd" effect stem from the larger amount of combined neurons and the diversity of the available models that follows, or from the larger amount of predictions that are used to compute the average? How much less effective would the crowd be, if less people make more ("e.g. top 5") predictions or if the crowd was made up of few individuals which are cloned?

If the limiting factor for humans is the time to predict based on many different models and not the amount of neurons we have, this would have interesting implications. Once, a single computer would have sufficient complexity to compete with the human brain, you could merely build more of these computers and average there opinions to arrive at better conclusions that any human could [1]. Computers wouldn't be just faster than humans, they would be much smarter, too.

[1] I'm talking about brain like ANN implementations here. Obviously, we already use specialized software to predict complex events like weather, better than any single human could. But these are not general purpose machines. "

[+] boucher|13 years ago|reply

Geoffrey Hinton (who leads this research at the University of Toronto) is teaching a coursera course right now about machine learning with deep neural networks:

https://class.coursera.org/neuralnets-2012-001

He talks about a surprising amount of cutting edge achievements being made by deep neural networks just over the last few months.

[+] cdgore|13 years ago|reply

There was also a video presentation by Peter Norvig posted a few days ago explaining research at Google done in collaboration with Geoffrey Hinton from the University of Toronto on deep learning at Google: http://news.ycombinator.com/item?id=4733387

[+] scrrr|13 years ago|reply

My current client is specialising in speech recognition, speech synthesis and automatic translation. They have something similar, focused on enterprise customers. I find this subject very interesting.

I am a Ruby guy and I only marginally get in contact with their C++ code, but from what I learned so far this stuff is extremely memory and CPU hungry. It also depends on having been fed the right amounts of input. That's why Google Translate is so good. They have tons and tons of data from all the websites they parse, and in many cases the content can be obtained in different languages. Corporate pages are often translated paragraph by paragraph by humans which results in perfect raw data to train these algorithms. Also for example all documents that the European Parliament produces are translated into the languages of all member states.

Everything that has to do with translation has to do with context. I think the software right now is as smart as a six year old kid, except that it has a much bigger vocabulary. But if you say "The process has stalled. Let's kill it." it probably only makes sense if you know you are talking about computers.

It's hard to imagine that computers one day might really understand everything we say. But just by using Google Translate I think they really might. Это является удивительным. (I don't speak Russian. I hope I didn't insult anyone now. ;))

[+] evoxed|13 years ago|reply

> Corporate pages are often translated paragraph by paragraph by humans which results in perfect raw data to train these algorithms.

Actually this may be one of the reasons why Google's Japanese translations are so terrible. The why isn't really relevant here[0] (perhaps you already know anyway) but it there are times when the raw data becomes the most misleading.

[0] Obviously I still mean those actually translating by hand, not the companies which just throw all of their material into Google Translate and consider it a finely proofed document. There are plenty of the latter which makes for an amusing loop in the system.

[+] rmc|13 years ago|reply

That's why Google Translate is so good. They have tons and tons of data from all the websites they parse, and in many cases the content can be obtained in different languages. … Also for example all documents that the European Parliament produces are translated into the languages of all member states.

This can backfire. I remember hearing that sometimes/back in the day "Baile Átha Cliath" (the Irish for "Dublin", the capital city of Ireland) would sometimes get translated as "London" the capital of the UK. This is due to Google Translate trying to match up Laws in Ireland (in the Irish language) with UK laws (which would be very similar or potentially based on the same original law). However in the Irish law "Baile Átha Cliath" would be replaced with "London".

Here's an example of it: http://translate.google.com/#ga/en/L%C3%A1%20alainn%20inniu%...

[+] anonymfus|13 years ago|reply

You did not insult me, but you sound like a robot because of preserved English sentence structure.

[+] swalsh|13 years ago|reply

> and in many cases the content can be obtained in different languages

I wonder if there's some feedback loop caused by websites that used google translate itself to offer the alternative versions :)

[+] jivatmanx|13 years ago|reply

Speech Recognition would probably be best fed in the same way - find a neutral-sounding speeches for which transcripts exist.

Best would be parliamentary speeches with transcipts, and the closed captioning for national news programs. The main constraint is storage space/computational power.

[+] dbul|13 years ago|reply

Translation is as much of an art as it is a science, so I wonder where this project is headed. Le Ton beau de Marot is a great book for illustrating this point.

In college I had studied Japanese and a friend introduced me to the anime cartoon Initial D. His copy had the original Japanese with English subtitles, and so I could assess the translation to some degree -- it was very good. On Netflix you can watch Initial D, but after 2 minutes I had to turn it off because the English dubbing really failed to capture the characters.

As someone noted in this thread, the presenter's synthesized voice in the linked video doesn't seem to reflect his own. If he could have said something like "Wo hui shua putonghua" and had the machine output say the same, it might have been more convincing.

[+] pbhjpbhj|13 years ago|reply

I was just pondering today why PCs have adopted spell checking as a standard feature but don't appear to use context techniques for word checking or grammar checking yet. Perhaps I'm just using the wrong apps?

The speaker says "to take in much more data" but it gets parsed by the speech-to-text as "to take it much more data" which is such an unlikely phrase I can't really work out why it's not auto-corrected.

The phrase provided doesn't appear to be in either Google's nor Bing's web indexes. Typing "to take i" in to either Google or Bing's search box produces a hit for "to take in" as the most likely match; and within milliseconds.

Similarly (and ironically) with "about one error out of" being parsed as "about one air out of".

That he goes on to say that they use statistical techniques and phrase analysis for the translation makes this sort of error all the more intriguing, why isn't that same statistical approach weeding out these sorts of errors.

Nonetheless an impressive demonstration.

[+] evoxed|13 years ago|reply

Green squiggly excluded, I can think of two reasons off the top of my head why even the most advanced general purpose grammar checker would be a bit of a controversial feature:

- Because grammar is typically more expressive, and dependent upon a concept that otherwise may not exist in words. Thus statistical grammar models and context checkers would be much more volatile to generating nonsense from user input (along the lines of the Sokal hoax) or restricting output to a range of acceptable models (giving the machine its own voice in a sense). That leads to the second thing...

- It kills freedom and creativity (or at least, how we receive it). Imagine comedy routines in stoic deadpan. Perfunctory exchanges in formal constructions (and vice versa). Obviously you can avoid all of these situations if you wanted to, but in that case it should probably be saved for those special occasions. It could probably help a lot of businessmen wanting to write their statements and messages in shorthand without spewing boilerplate text. But it's potentially damaging to every child or student who is still finding out how they want to express themselves in the given context.

Note: I think it is fair to assume that grammar checking would include the ability to reformulate or generate text that obeys the relevant models. Spell checkers suggest spellings, grammar checkers have to suggest fixes and changes as well, and if we want to get any further than Win98 era Word it will probably have to have a plain old fix-it generator as well.

[+] zyb09|13 years ago|reply

Fun thing to do: you can turn on transcribe audio on Youtube and directly compare how Google's speech recognition tech stacks up against Microsoft's.

[+] zmmmmm|13 years ago|reply

I bet the audio directly from his mic would be enormously better quality than whatever YouTube has recorded. Plus Google can hardly afford to dedicate gigantic amounts of CPU to the transcription - they'll be going for a crude but useful job where for this demo he probably has a whole lot of CPU grunt just dedicated to it.

[+] bfung|13 years ago|reply

see below for some translation links, but google translate is pretty bad compared to bing translator in Chinese.

[+] dchuk|13 years ago|reply

The implications of this kind of technology reaching consumers in the next decade or so are really interesting.

If we can get to the point of having handheld devices that can accomplish live translation of spoken word, what exactly is the point of different languages anymore?

[+] Breakthrough|13 years ago|reply

I don't follow your logic... Without those different languages, you wouldn't have anything to translate in the first place. If anything, this type of technology promotes independent and different languages, as it makes it so much easier to communicate with others regardless of your native tongue.

Also, bravo to Microsoft; I'll remove my jaw from the floor after I watch your video a second time.

[+] gizmo686|13 years ago|reply

How about two years ago: https://play.google.com/store/apps/details?id=com.vandroid.b...

[+] zalew|13 years ago|reply

I have no doubt a few more long years will have to pass until those solutions reach the mass market, but this is extraordinary especially for someone like me, passionate about travelling. Our generation witnessed the shift towards cheaper flights, easier accommodation booking and web/mobile tools growing year by year becoming more helpful in organizing our visits and seeking information about places and cultures we don't know. We, or the future generation, will probably witness the fall of the language barrier, it's truly amazing and one of the most important shifts in our global experiences.

[+] Groxx|13 years ago|reply

Skip to 8 minutes to hear the actual translation.

I'd love some comparison - that doesn't sound like the same voice to me (awfully close to the 'standard' computer voice, IMO), but some of it is crummy recording quality, and showing the flexibility would go a long way toward convincing me.

[+] mpdaugherty|13 years ago|reply

I agree that it doesn't really sound like him, but the voice is far better than most Chinese computer voices that I've heard and is totally understandable.

Seems like my years of learning Chinese and living in China are about to become useless...

[+] ctingom|13 years ago|reply

Now imagine this on Skype as a premium feature.

[+] polshaw|13 years ago|reply

Near real time speech to speech translation is awesome[1], but the voice sounded more like how i would picture ASIMO speaking (ie a 1980s speech synthesis) than 'his' voice.

1. is anyone here fluent in mandarin to assess the quality of the output?

[+] ebzlo|13 years ago|reply

Mandarin speaker here. I'm more impressed that it was able to reconstruct the sentences properly (where traditional translation tools typically fail). The output was fine. Obviously doesn't sound like a human speaker, but the tones are correct.

[+] sterling312|13 years ago|reply

Yea, it sounds a bit mechanical. With that said, based on the priming he's given with the talk on waveform, I'm guess they are simply breaking the English speech into the waveform with corresponding frequency and simply mapping that over to the Chinese counterpart.

It would be even cooler if they created a distribution of possible sound frequency for each syllables in both English and Chinese, and determined where in the distribution his speech pattern lies, and transfer the "ranking" in the distribution. Hence you get a subjective transformation instead of a objective one. :)

[+] joering2|13 years ago|reply

Mark my words; some good changes are happening at MSFT. There are some indicators that suggest this may be a come back. Surface seems to gain momentum, while future of OS Windows will be freemium plus ads (as you can imagine hundreds of millions of "screens" plugged in).

[+] orjan|13 years ago|reply

Original post: http://blogs.technet.com/b/next/archive/2012/11/08/microsoft...

[+] mmuro|13 years ago|reply

It's pretty incredible how far language-to-language technologies have come and how far they still have to go.

Very cool stuff.

[+] hammock|13 years ago|reply

Putting on my tinfoil hat here, if all it takes to build a speech model to impersonate someone's voice is an hour's worth of them talking... what happens when the wrong person gets that? For example, the government or a corporation (a internet phone service, maybe) uses to fabricate evidence of conversations that never really happened; could also be used to aid in identity theft

[+] evoxed|13 years ago|reply

To indulge you just a little bit, I think it most likely result in an a rapid expansion of forensic industries. While I have no legitimate experience in signal processing, I imagine there would be ways to deduce whether or not such impersonations were credible to some degree. Whether or not that would stop tech-savvy marketers and con artists out of scamming grandma, I don't know. We'll have to wait and see what 21st century holds for future firewalls. Of course, if someone with any knowledge on the subject would like to step in and point out how stupid my response sounds to them, I'd be glad to become more informed!

[+] s_henry_paulson|13 years ago|reply

Did you watch the video where the computer was talking?

There is no possible way to confuse that robot voice with the speaker. Technology like you suggest is a long way away from the consumer market.

[+] aw3c2|13 years ago|reply

Aggregator spam, direct link is http://blogs.technet.com/b/next/archive/2012/11/08/microsoft...

[+] scep12|13 years ago|reply

My Android phone already does voice-to-text better than the system demoed in that video. Looks like Microsoft's research needs a bit more tuning before it can be declared 'amazing'

[+] bobwaycott|13 years ago|reply

Damnit. This is pretty much the very idea I had in college around 12 years ago. At the time, there was nowhere near the required technology to pull this off. Over the last few months, I'd begun rethinking through the idea again, feeling the time was right to pull this off as a killer idea. Even began trying to investigate how to pitch this to create a startup focused solely on this problem.

Now it seems the time may be too late. Rats.

[+] hnewser1|13 years ago|reply

You are a bit naive, eh? Also:

http://en.wikipedia.org/wiki/Universal_translator

[+] tsahyt|13 years ago|reply

This is really impressive, especially the speech recognition part. I can't really judge anything else, since I don't speak a word of mandarin. The speech recognition though is easily the best I've ever seen. This is almost the kind of recognition rate needed for voice controlled interfaces to finally work. Exciting stuff.

[+] ffk|13 years ago|reply

It looks like a translation we can hear occurs around 8:10. Is anyone able to verify the correctness of the speech? (Also, remembering it's a demo and it has probably been tested multiple times for that phrase).

121 comments