top | item 34424503

(no title)

Semi offtopic, but for some time I have been dreaming of training chatbot to communicate in cuneiform or hieroglyphs to bring some old languages back alive. Could it be possible, using old tablets as training data?

discuss

whatswrong|3 years ago

That's basically the problem of unsupervised machine translation using mainly monolingual corpora. It means giving a machine learning model tons of text in two languages and let it figure out how to do translation between some old language X and e.g. english. There's no need to feed it a parallel corpora, i.e. examples of sentences in X languages and their translations in english.

In some situations, this seemingly impossible task is doable and can yield good results. Researchers sometimes need to kickstart their models by giving them a mapping between words of the two languages (for english <-> french: "cat" <-> "chat", "book" <-> "livre" and so on). That's just simple vocabulary. While it's technically possible to learn this mapping from scratch, it's too difficult as for now.

Do you know of the Encoder-Decoder architecture? You feed something (image, text) to the encoder which compresses it to a very dense representation, and the decoder try to use the resulting dense vector to do useful stuff with it. The input could a sentence in english, the encoder then encodes it and the decoder tries to use the output of the encoder to generate the same sentence but in french. These architectures are useful because directly working with "plaintext" to learn how to do translation is way too expensive. I mean, that's one of the reasons.

What the encoder does is mapping a "sparse" representation of a sentence (plaintext) to a dense representation in a well-structured space (think of word2vec which managed to find that "king" + "woman" = "queen"). This space is called the "latent space". Some say it extracts the "meaning" of the sentence. To be more precise, it learns to extract enough information from the input and present it to the decoder in such a way that the decoder becomes able to solve a given task (machine translation, text summarizing etc).

One of the main assumption of the unsupervised models using monolingual data only is that both languages can be mapped to the same latent space. In other words, we assume that every sentences/texts in english has its exact french (or whatever) equivalent, that the resulting translated sentences contain exactly the same information/meaning as the original ones.

That's quite the dubious assumption. There's obviously some ideas, some stuff that can be expressed in some languages but can't be exactly expressed in some others. While theoretically unsound, however, these models were able to achieve pretty damn good results in the last couple of years.

jokethrowaway|3 years ago

I think we need a generic ai before we're able to do that as the data set is small and you would need to infer the rules. A human is able to learn rules way more efficiently than ChatGPT.

ChatGPT is just trained on a lot of data.

prox|3 years ago

Intriguing question. Would it need large sets of hieroglyphic training data and is there enough of it? Or would a translation module be enough?

a follow up question: could a chatbot teach you said language?

ma2rten|3 years ago

No, you need billions of words to train a large language model.

KRAKRISMOTT|3 years ago

Assuming all human languages have a common shared semantic meaning in latent space (I am flipping cause and effect here, but our purposes it doesn't really matter), and assuming that human languages largely follow the same pattern (this assumption is based on the fact that we can trace the roots of modern languages back to the Phoenician script), it is reasonable to assume that we can fine-tune a self supervised model on a tiny amount of data. (The emergent properties of a LLM is carrying a lot of weight here, many of the assumptions rely on the fact that LLM's emergent properties arise from the idea that the latent structure of various languages is learnt by the model)

dhoe|3 years ago

They don't have to be in the output language.