jw4ng's comments | WingNews

jw4ng | 3 years ago | on: No Language Left Behind

Without any supervised training data, it's pretty difficult to create a very good translation model. For many languages, data might only be available through religious domains such as the Bible, or not available at all. We created a dataset called NLLB-Seed for this reason --- it's approx 6K sentences available for 39 languages that translates a broad set of topics from Wikipedia. We found that with a dataset like NLLB-Seed, we're able to have sufficient supervised signal to jumpstart our automatic dataset creation pipeline. Of course, the more high quality aligned data the better the model performance, but our project explores how we can make models more efficient at learning even when the training data is small.

Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.

jw4ng | 3 years ago | on: No Language Left Behind

This enables any researcher to use our code freely, and build on top of it for their own research. We are not intending to commercially licensing our project.

jw4ng | 3 years ago | on: No Language Left Behind

We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.

jw4ng | 3 years ago | on: No Language Left Behind

I gained a deeper understanding of what it truly means to be inclusive. Every language is unique just like everybody and making sure content works for all and including as many people as possible is really really hard, but through this project i'm hopeful we are taking it one step further

jw4ng | 3 years ago | on: No Language Left Behind

All translation directions are direct from language X to language Y, with no intermediary. We evaluate the quality through 40,602 different translation directions using FLORES-200. 2,440 directions contain supervised training data created through our data effort, and the remaining 38,162 are zero-shot.

jw4ng | 3 years ago | on: No Language Left Behind

hey there, I work on this project. We categorize a language as low-resource if there are fewer than 1M publicly available, de-duplicated bitext samples.

also see section 3, table 1 in the paper: https://research.facebook.com/publications/no-language-left-...

jw4ng | 3 years ago | on: No Language Left Behind

Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!