jw4ng
|
3 years ago
|
on: No Language Left Behind
Without any supervised training data, it's pretty difficult to create a very good translation model. For many languages, data might only be available through religious domains such as the Bible, or not available at all. We created a dataset called NLLB-Seed for this reason --- it's approx 6K sentences available for 39 languages that translates a broad set of topics from Wikipedia. We found that with a dataset like NLLB-Seed, we're able to have sufficient supervised signal to jumpstart our automatic dataset creation pipeline. Of course, the more high quality aligned data the better the model performance, but our project explores how we can make models more efficient at learning even when the training data is small.
Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.
jw4ng
|
3 years ago
|
on: No Language Left Behind
This enables any researcher to use our code freely, and build on top of it for their own research. We are not intending to commercially licensing our project.
jw4ng
|
3 years ago
|
on: No Language Left Behind
We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.
jw4ng
|
3 years ago
|
on: No Language Left Behind
I gained a deeper understanding of what it truly means to be inclusive. Every language is unique just like everybody and making sure content works for all and including as many people as possible is really really hard, but through this project i'm hopeful we are taking it one step further
jw4ng
|
3 years ago
|
on: No Language Left Behind
All translation directions are direct from language X to language Y, with no intermediary. We evaluate the quality through 40,602 different translation directions using FLORES-200. 2,440 directions contain supervised training data created through our data effort, and the remaining 38,162 are zero-shot.
jw4ng
|
3 years ago
|
on: No Language Left Behind
jw4ng
|
3 years ago
|
on: No Language Left Behind
Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!
Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.