top | item 34424847

(no title)

That's basically the problem of unsupervised machine translation using mainly monolingual corpora. It means giving a machine learning model tons of text in two languages and let it figure out how to do translation between some old language X and e.g. english. There's no need to feed it a parallel corpora, i.e. examples of sentences in X languages and their translations in english.

In some situations, this seemingly impossible task is doable and can yield good results. Researchers sometimes need to kickstart their models by giving them a mapping between words of the two languages (for english <-> french: "cat" <-> "chat", "book" <-> "livre" and so on). That's just simple vocabulary. While it's technically possible to learn this mapping from scratch, it's too difficult as for now.

Do you know of the Encoder-Decoder architecture? You feed something (image, text) to the encoder which compresses it to a very dense representation, and the decoder try to use the resulting dense vector to do useful stuff with it. The input could a sentence in english, the encoder then encodes it and the decoder tries to use the output of the encoder to generate the same sentence but in french. These architectures are useful because directly working with "plaintext" to learn how to do translation is way too expensive. I mean, that's one of the reasons.

What the encoder does is mapping a "sparse" representation of a sentence (plaintext) to a dense representation in a well-structured space (think of word2vec which managed to find that "king" + "woman" = "queen"). This space is called the "latent space". Some say it extracts the "meaning" of the sentence. To be more precise, it learns to extract enough information from the input and present it to the decoder in such a way that the decoder becomes able to solve a given task (machine translation, text summarizing etc).

One of the main assumption of the unsupervised models using monolingual data only is that both languages can be mapped to the same latent space. In other words, we assume that every sentences/texts in english has its exact french (or whatever) equivalent, that the resulting translated sentences contain exactly the same information/meaning as the original ones.

That's quite the dubious assumption. There's obviously some ideas, some stuff that can be expressed in some languages but can't be exactly expressed in some others. While theoretically unsound, however, these models were able to achieve pretty damn good results in the last couple of years.

discuss

No comments yet.