No Language Left Behind

Etheryte|3 years ago

I'll believe it when I actually see it. I'm a native of a reasonably small language spoken by about a million people and never have I ever seen a good automatic translation for it. The only translations that are good are the ones that have been manually entered, and those that match the structure of the manually entered ones. I think the sentiment is laudable and wish godspeed to the people working on this, but for the time being I don't see it becoming a reality yet. When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.

hello_im_angela|3 years ago

It's an extremely difficult problem indeed. A lot of people on the team speak low-resource languages too (my native language as well!), so definitely resonate with what you're saying. My overall feeling is: yeah it's hard, and after decades we can't even do German translation perfectly. But if we don't work on it, it's not gonna happen. I really hope that people who are excited about technology for more languages can use what we've open sourced.

yosito|3 years ago

I speak a medium-resource language with 11 million speakers. Google Translate works so poorly with it that translations are often nonsensical. But DeepL works so well with it that translations are often indistinguishable from native speaking translations. I'm a big believer that the model can make a huge difference.

alfiedotwtf|3 years ago

> I ever seen a good automatic translation for it. > > When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.

I speak a language where I've never seen any translation for it... and when translated manually, my mum totally butchers the meaning lol.

Either way, any work in this area is more than welcome, but damn it's a hard problem.

bobsmooth|3 years ago

There's a section where you can try reading translated children's books. See if your language is supported and how good the translation is.

pesenti|3 years ago

Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-machine-t...

Paper: https://research.facebook.com/publications/no-language-left-...

Github: https://github.com/facebookresearch/fairseq/tree/nllb/

robocat|3 years ago

Also note comments from hello_im_angela (= Angela Fan) and jw4ng (= Jeff Wang). Those are the HN accounts for Angela and Jeff from No Languages left Behind.

jkw|3 years ago

Hey all, I work on this project. Full list of languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...

As well as in the research paper: https://research.facebook.com/publications/no-language-left-...

mikewarot|3 years ago

The analogy I like the most is that they've found the "shape" of languages in high dimensions, and if you rotate the shape for English the right way, you get an unreasonably good fit for the shape of Spanish, again for all the other languages.

We're at a point where it's now possible to determine the shape of every language, provided there are enough speakers of the language left who are both able and willing to help.

<Snark> Once done, Facebook can then commodify their dissent, and sell it back to them in their native language. </Snark>

gfaster|3 years ago

Anyone who knows or is learning another language can easily tell you that the "warping" methodology of MTL is insufficient. There was a really good video by Tom Scott [1] that talked about this but the short version is that there is critical bits of language in context and inferred by speakers. Any accurate MTL needs nearly full context both on the page and in the cultural moment, in addition to probably needing to ask questions of the author.

[1]: https://www.youtube.com/watch?v=GAgp7nXdkLU

goldemerald|3 years ago

The shape analogy doesn't really apply with modern language models. Each word gets its own context dependent high dimensional point. With everything being context dependent, simple transformations like rotations are impossible. A more accurate perception is that any concept expressible in language now has its own high dimensional representation, which can then be decoded into any other language.

Groxx|3 years ago

>REAL-WORLD APPLICATION

>Translating Wikipedia for everyone

Hmmm.

While there is very definitely utility in doing things like this, I do kinda fear "poisoning the well"-like effects of feeding (even partially-) AI-generated-data into extremely common AI-data-sources.

There's some info on it in a blog post[1] and the MediaWiki "Content translation" page[2], but does anyone know of any studies on the quality of the translations produced? I can absolutely see it being a huge time-saver for people who are essentially fluent in both (there's a lot of semi-mechanical drudgery in translating stuff like this that could be mostly eliminated)... but people are pretty darn good at choosing the easy option of trusting whatever they're given rather than being as careful as they should be. It kinda feels like it runs the risk of passively encouraging people to trust the machine's choice over their own, as long as it isn't obviously nonsense, and the cumulative effect could be rather large after a while.

[1]: https://diff.wikimedia.org/2021/11/16/content-translation-to...

[2]: https://www.mediawiki.org/wiki/Content_translation

jhugo|3 years ago

Yeah, I really hope they don't do this. I live in a country where I don't speak the language well, so I am using Google Translate and DeepL [0] all day every day. The quality of translations of real-world text is so incredibly variable. There is literally no way to know when it will suddenly reverse the meaning of a sentence, or produce something that sounds like it makes sense, but in terms of meaning bears no relation to the input at all.

A machine-translated Wikipedia would not be a trustworthy source of information at all, yet would look like one. I think that does significantly more harm than good.

[0] Suggestions for better alternatives welcomed.

debesyla|3 years ago

On top of that - a lot of language specific content has to include sources in that same language.

(As an example, it would be absurd for lithuanian wikipedia to include sources in japanese - that would be not usable AND not usefull for the wikipedia readers, editors...)

jw4ng|3 years ago

Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!

david_allison|3 years ago

Hi Jeff,

I currently host the largest collection of bilingual Manx[0] <-> English texts (~1MM words). How would I formally get in contact to chat about the steps to make machine translation available (and would there be grant opportunities available for further production of machine-readable data?)

[0] https://en.wikipedia.org/wiki/Manx_language

tkgally|3 years ago

Thank you for your exciting work and for coming onto HN to respond to questions.

I am a former professional translator (Japanese to English) and am now supervising research at the University of Tokyo on the use of machine translation in second-language education. As I have written in a few papers and essays [1], advances in MT have raised serious questions for language teachers. The ready availability of MT today, including on Facebook and Instagram, means that language students use it a lot while studying. We don’t know yet, though, how that use of MT might affect our students’ acquisition of other languages or their motivation to keep studying those languages.

One of the hurdles educators and researchers face is finding out how MT is being used in the real world. Most education in modern languages is focused on giving students language skills that they will be able to use later in work, education, and daily life, and textbooks and other learning materials are typically shaped around real-world situations. We are now struggling to adapt those materials for the age of MT, because data on the actual use of MT is very hard to get.

Like Google, Microsoft, Baidu, DeepL, and others, Meta must have huge amounts of data on how your users are using MT to communicate. Any information and insights about that MT usage that you can share with world—just as you have generously shared your NLLB models—would be most welcome.

[1] http://gally.net/writings.html

kouteiheika|3 years ago

If your goal is to make inclusive translation more widely available why license the models under a non-commercial license? This basically makes it impossible to use legally (or at least without a lot of legal risk) for essentially anyone due to the vague definition of what's commercial. Is Facebook hurting for money and looking to commercially license this model on request?

shuraih|3 years ago

Hey Jeff, I’m a native speaker of Dhivehi — the language spoken by the people of Maldives. Since I couldn’t find a full list of supported languages I was wondering if Dhivehi is / would be integrated.

yosito|3 years ago

I'm curious how much work it takes to prepare training data for a language. From anecdotal experience, I've always been able to learn some basic survival skills in a new language by studying the translations of about 20 key phrases for a week or so, which give me the ability to combine them into a few hundred different phrases and survive most daily transactions. So I always imagine that training a language model is similar, just on a much larger scale. It seemed to me that there could be a standard text that includes a lot of important topics and contexts, which just needs to be manually translated into a target language and then fed to the model. I imagine it being about the size of a large book, so I imagine that adding a new language to a model would cost a similar amount to paying to have a book translated. Obviously the size of the input text would have an effect on how good the model's translations are, and domain specific translations would require more specific input. While having a full translation of an entire library seems like a good way to train a model that's used to translate everything, it seems like a small percentage of the library would be enough to produce native-level translations for most domains.

How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language?

concinds|3 years ago

These initiatives are always couched in "inclusion" rhetoric (the very name of your project is telling); I don't doubt for a second that it's a genuine sentiment, but I strongly suspect your team hasn't thought through the full, self-defeating implications of universal language translation.

The problem is that it increases the risk of monoculture to 100%. Without language barriers, cultural diversity is lost, not gained, since you have winner-take-all effects[0]. Instead of helping revive languages, it'll make American ideas, mores, morality (Puritanism), philosophies, and political values more dominant worldwide.

To be clear, this will increase economic opportunity, but will inevitably kill cultural diversity.

Is your team considering or studying this?

[0]: https://www.sampaxinos.com/mental-models/the-winner-takes-al... (or see Taleb's works)

pagekicker|3 years ago

Hi, I'm putting together an online event called 31 Days of AI for Book-Lovers to coincide with US National Book Month, October 2022. I was struck by the specific call-out to translating literature on your demo page and would like to feature a specifically book-related application of NLLB on one of 'anchor days'. Can someone work with me on this?

Vetch|3 years ago

Hi, I'm looking but can't seem to find instructions on how to do tokenization. Where is spm model, is it "flores200_sacrebleu_tokenizer_spm.model" or something else? And is it direct or spm -> dict? Or how to prime model for a specific language pair?

pesenti|3 years ago

Are all the 200x200 translations going directly or is English (or another language) used as an intermediate for some of them?

dangom|3 years ago

What is the greatest insight you gained and could share with non-experts from working on this project?

jefflombardjr|3 years ago

Gangi þér vel!

mike8889|3 years ago

[deleted]

kgeist|3 years ago

I wonder how it differs from what Yandex.Translate did back in 2016: [0]

>The affinity of languages allows one common model to be trained for their translation. That is, “under the hood” of the translator, the same neural network translates into Russian from Yakut, Tatar, Chuvash and other Turkic languages. This approach is called many-to-one, that is, "from many languages \u200b\u200binto one." This is a more versatile tool than the classic bilingual neural network. And most importantly, it is the many-to-one approach that makes it possible to use knowledge about the structure and vocabulary of the Turkic languages, learned on the rich material of Turkish or Tatar, to translate languages like Chuvash or Yakut, which are less “resource-rich”, but no less important for the cultural diversity of the planet.

>In order to create a unified model for translating Turkic languages, Yandex developed a synthetic common script. Any Turkic language is translated into it, so that, for example, the Tatar “dүrt” (“four”) written in Cyrillic becomes similar to the Turkish dört (“four”), not only from the point of view of a person, but also at the level of similarity of lines for a computer.

This way they added support for Turkic and Uralic languages which are very underrepresented on the Internet. But I don't know what the quality of their translation is: even though I live in a region where Mari is spoken (indigenous Uralic language) and my wife is Mari, none of us, sadly, speak the language.

[0] https://techno-yandex-ru.translate.goog/machine-translation/...

hello_im_angela|3 years ago

We represent all languages in their natural script, rather than transliterating them into a common synthetic one.

Regarding Mari: extremely interesting language, exciting to hear that you are from that region. We are interested in working on this one (likely in the "Hill Mari" variant), but currently do not support it.

microtherion|3 years ago

As a native Swiss German speaker, my native language is not only low resource in general, but has the additional difficulty of not having a standardized orthography (many native speakers will exclusively write in Standard German, and use Swiss German only for spoken communication).

So you have a language with some economic opportunity (a few million speakers in a fairly wealthy country) but no clearly defined written interface, and an ambivalent attitude of many speakers towards the very idea of writing the language.

hello_im_angela|3 years ago

sooo real. Many low-resource languages have many different natural variants, can be written in multiple scripts, don't have as much written standardization, or are mainly oral. As part of the creation of our benchmark, FLORES-200, we tried to support languages in multiple scripts (if they are naturally written like that) and explored translating regional variants (such as Moroccan Arabic, not just Arabic).

As an aside, the question of how to think about language standardization is really complex. We wrote some thoughts in Appendix A of our paper: https://research.facebook.com/publications/no-language-left-...

visarga|3 years ago

Another avenue for machine translation is to use audio instead of text. There is much more audio data available and being generated on a daily basis, especially for cases like yours it would be very useful.

tsm|3 years ago

Similar issue with Scots, which has many variant orthographies but is frequently written in mostly-English anyway.

rmbyrro|3 years ago

This only makes the problem behind the NLLB project even more interesting to solve

otreblatercero|3 years ago

Not a single mesoamerican language is present. Maya, Náhuatl, Otomí, Zapoteco, etc. And these languages are big, they are spoken by millions and even have literature. Náhuatl and Maya are spoken in Central America.

bertil|3 years ago

Are there online corpora, like Wikipedia, that could be used to train the models? Are those under a permissive enough license to be used for model training?

If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.

albertzeyer|3 years ago

Note that very recently Google has done something very similar: "Building Machine Translation Systems for the Next Thousand Languages": https://arxiv.org/abs/2205.03983 https://ai.googleblog.com/2022/05/24-new-languages-google-tr...

The Facebook paper has some direct comparison to that work.

jkw|3 years ago

Evaluation was important to us, and we really wanted to have a benchmark that covers all 200 languages

yellowapple|3 years ago

Hopefully the Scots language model wasn't trained on Wikipedia.

btheshoe|3 years ago

I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.

albertzeyer|3 years ago

I don't really remember the exact numbers anymore, but covering only the top 5 languages will cover maybe 40% of the world population, while covering the top 200 languages (many of them low resource) will cover maybe 90% of the world population.

Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...

Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...

"96% of the world’s languages are spoken by only 4% of its people."

Although this statement is more about the tail from the approx 7000 languages.

goodside|3 years ago

"Low-resource language" isn't just a euphemism for "language almost nobody speaks". There are many languages that are widely spoken but nonetheless are hard to obtain training data for. Getting something like Wikipedia going for a minority language can be a difficult chicken-and-egg problem because users will use English for its completeness/recency, despite their limited fluency, and the native-language Wikipedia remains neglected. So you can end up in a situation where users use one language for social media and another for news/research, and Facebook is in a unique position to care about the former.

cyphar|3 years ago

Aside from the fact that being able to generalise a model with very little training data is an important AI research problem to solve, language death is a serious concern and is being accelerated due to the fact that many languages are not supported at all by modern technology (leading to "prestige language" pressures that are a known cause of historical language death).

For instance, Icelandic is not supported by any modern smartphone platform which has lead to Icelandic natives communicating with each other in English and very little information is translated to Icelandic[1,2].

That being said, I am worried that having translations that are "too good" could also act to accelerate language death as the importance of keeping languages alive will seem less significant (to non-language-nerds) if we can translate works written in that language to any other language with very small datasets. Luckily I'm not convinced that AI models will be able to produce convincing and consistent translations for a long time -- languages are so different in so many ways that I can't see how adding more dimensions and parameters to a model would account for them.

[1]: https://youtu.be/qYlmFfsyLMo?t=141 [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...

wilde|3 years ago

The point is that there are lots of humans who speak these languages and use tech. They just don’t use Wikipedia so getting a good translation corpus going was harder.

Jabbles|3 years ago

Surely the fact that they did all the high-resource languages first and are only now getting round to the less-popular ones demonstrates that that is not, in fact, the case?

tehsauce|3 years ago

I think the reason low resource languages are prioritized is to compensate for the fact that AI research normally has a tendency to marginalize these languages.

quink|3 years ago

The examples given are, with native speaker numbers, Assamese (15 million), Catalan (4 million) and Kinyarwanda (10 million). These alone are more than an Australia.

Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).

I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.

That's the payoff.

jw4ng|3 years ago

We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.

dunefox|3 years ago

Small data, big meaning is much more important than big data, little meaning. Much closer to real intelligence.

munificent|3 years ago

Cynical answer: It's good PR.

onurcel|3 years ago

hi @btheshoe, I work on this project in the data part. As others mentioned, the amount of data available for a language is not correlated to the number of speakers of that language, which explains the potential impact of focusing on these.

labrador|3 years ago

I'll know AI translators are any good when the United Nations starts using them

"Skills required: United Nations translators are required to have a perfect command of their main language and an excellent knowledge of, in most cases, two other official languages"

https://www.un.org/dgacm/en/content/translation

epolanski|3 years ago

My ex is a translator at an embassy, and she always said that ai translators are a godsend.

On one side they make their work easier as they can focus more on correcting the ai produced text and focus on author's meaning while eliminating lots of plumbing.

On the other hand they increased the amount of business because much more text is translated than at any other point in history, which requires validation in most business, legal and even personal contexts. Without ai translators those translations would've not happened in the first place.

samatman|3 years ago

An organization built out of pure prestige, with no concept of monetary profit, has zero pressure to stop employing their classmates as translators, ever.

thamer|3 years ago

Does this mean that Facebook's advertising system will finally start rejecting ads calling for genocide in Myanmar, and that they will finally flag comments expressing the same intent? As recently as March of this year there were reports that Facebook accepted ads that said "The current killing of the Kalar is not enough, we need to kill more!" or "They are very dirty. The Bengali/Rohingya women have a very low standard of living and poor hygiene. They are not attractive".

Full story: https://abcnews.go.com/Business/wireStory/kill-facebook-fail...

These were submitted to test Facebook's systems, because there's a good reason not to trust their promises on this front. Facebook was used extensively to propagate hate speech in Myanmar during the crisis of 2017, with their moderation tools and hate speech detection system letting through a ton of hateful content with real-world consequences, in the course of an actual ethnic cleansing campaign.

Other references: "Facebook Admits It Was Used to Incite Violence in Myanmar" https://www.nytimes.com/2018/11/06/technology/myanmar-facebo... (2018)

"Violent hate speech continues to thrive on Facebook in Myanmar, AP report finds" https://www.cbsnews.com/news/myanmar-facebook-violent-hate-s... (9 months ago)

bertil|3 years ago

The issue here wasn’t that Facebook didn’t have resources for a basic translation tool (able to translate open death threats) but that Burmese had inconsistent encoding. That delayed the translation effort.

https://www.localizationlab.org/blog/2019/3/25/burmese-font-...

vjerancrnjak|3 years ago

What are hardware requirements to run this?

I see the mixture model is ~ 300 GB and was trained on 256 GPUs.

I assume distilled versions can easily be run on one GPU.

hello_im_angela|3 years ago

We release several smaller models as well: https://github.com/facebookresearch/fairseq/tree/nllb/exampl... that are 1.3B and 615M parameters. These are usable on smaller GPUs. To create these smaller models but retain good performance, we use knowledge distillation. If you're curious to learn more, we describe the process and results in Section 8.6 of our paper: https://research.facebook.com/publications/no-language-left-...

kwhitefoot|3 years ago

What is a "low resource language"?

jw4ng|3 years ago

hey there, I work on this project. We categorize a language as low-resource if there are fewer than 1M publicly available, de-duplicated bitext samples.

also see section 3, table 1 in the paper: https://research.facebook.com/publications/no-language-left-...

pesenti|3 years ago

https://datascience.stackexchange.com/questions/62868/high-l...

Tabular-Iceberg|3 years ago

My concern with this is that in low resource languages the unavoidable biases of the ML models might overpower their own organic development.

We shrug off all the little quirks of machine translated text because it usually gets the point across, and we recognize them as quirks because most of what we read was written by real people with no such quirks. But when most of what you read contain those quirks, I fear those will quickly become the standard way of writing and even speaking in those languages.

madrox|3 years ago

This happens without machine translation in the wild already with pidgin. If you want to see real life pidgin in action, watch korean and english gamers interact in FPS games. This has been common at the borders of cultures where two languages interact.

Point being, I'm not sure if language purity is more valuable than functionally allowing its people to interact with things they couldn't otherwise. Put another way, should we leave these people locked out of many online resources they can't read because we fear of corrupting their language? Give these people the option and let them decide. Language evolves over time anyway.

texaslonghorn5|3 years ago

In a worst case you can end up with the Scots Wikipedia situation, where some power editor created a bunch of pages using an entirely fabricated, overly stereotypical language and that influenced what people thought Scots actually was.

protomyth|3 years ago

I think it will interesting when it runs into a language (e.g. Dakota) where the women and men speak differently. Should be an interesting test.

HKH2|3 years ago

Won't people trying to learn a low resource language as as a second language also bring their influence?

TaupeRanger|3 years ago

So they have a system that can translate to languages for which there isn't as much data as English, Spanish, etc. Waiting for a Twitter thread from a native speaker of one of these "low resource languages" to let us know how good the actual translations are. Cynically, I'd venture that they hired some native speakers to cherry pick their best translations for the story books. But mostly this just seems like a nice bit of PR (calling it a "breakthrough", etc.). I can't imagine this is going to help anyone who actually speaks a random, e.g., Nilo-Saharan language.

hello_im_angela|3 years ago

If you're curious to try the system yourself, it's actually being used to help Wikipedia editors write articles for low-resource language Wikipedias: https://twitter.com/Wikimedia/status/1544699850960281601

alexott|3 years ago

Twitter may not be representative imho because of the short text. It should first come to a problem of reliable language detection, and Twitter is quite often wrong there

onurcel|3 years ago

in this work we tried to rely not only on automated evaluation scores but also on human evaluation for exactly this reason: we wanted to have a better understanding of how our model actually performs and how it correlates to automated scores.

account42|3 years ago

> Essential cookies

> These cookies are required to use Meta Products. They’re necessary for these sites to work as intended.

What cookies does Facebook "need" to serve a simple article?

LtWorf|3 years ago

Facebook translations are horrifying for the mainstream languages already. They go from completely wrong to kinda understandable but still wrong.

rmbyrro|3 years ago

Looks like they're investing to get better. The model is also available and they called for contributions to improve it.

ShamelessC|3 years ago

Look, I fucking hate Facebook to the point that I can't really be objective about their research. Whenever I see a section on ethical implications or impacts I just think about shit like Myanmar or the insurrection and laugh (cry).

But this is a shallow dismissal that doesn't add anything valuable to the discussion.

"Oh they made their _terrible_ (probably state of the art) machine translation _better_??! Those monsters!!"

NoInkling|3 years ago

I know DeepL doesn't do low-resource languages, but it would be interesting to see a translation quality comparison between the two.

enos_feedler|3 years ago

I was two sentences in before I realized the headline wasn’t “No Luggage Left Behind”

onurcel|3 years ago

this is actually our recurring joke for our team meeting offsites!

schoen|3 years ago

I wonder if spy agencies have already developed, but not published, high-quality SMT methods for lots of minority and little-known languages. :-(

(Edit: and speech-to-text models.)

_nalply|3 years ago

"No Language Left Behind" - really?

Did the people at Meta think about the Signed Languages of the Deaf?

I didn't find a mention. Even Ctrl-F deaf didn't yield anything.

langsoul-com|3 years ago

So so many words but not a hint of any demo. It's just magic according to Facebook. Plz couldn't they at least have a crappy demo to break?

pdonis|3 years ago

tl/dr: Now your words can be misconstrued by far more people than before, because AIs will translate the misunderstandings into as many languages as possible.

zzzeek|3 years ago

So glad it's Facebook doing this and not some other weird company, when translating and delivering information to every culture on the planet it's good to have a trustworthy, ethical company without any past (or heck, even any current, ongoing) issues in spreading misinformation around the globe and contributing to the rise of fascism across the world while profiting massively off of it and denying any culpability, making sure it all goes smoothly.

bvanderveen|3 years ago

Great! Facebook no longer have to provide content moderation in all the various corners of the world where they could accidentally enable the dissemination of misinformation and hate speech in minority languages. They can simply transform it into English and run it back through the existing moderation tooling!

Understanding foreign culture is about reading automated translations of online comments into your native language. It has nothing to do with putting the effort into learning a language and understanding the nuances and current events and issues of the culture it embeds.

The ESL (English as a single language) speakers over at Facebook don't even need to understand foreign cultures, because they already know everyone in the world needs to spend their lives staring into the Metaverse. So grateful that they are working on the world's fattest pipeline for exporting Anglophone culture to every corner of the planet!

159 comments