top | item 34624075

Ask HN: I want to train a LM on my home country's dialect, how can I do it?

24 points| the_generalist | 3 years ago | reply

I'm from Algeria. The language spoken on a daily basis by almost everybody is a weird mix of different languages : french, arabic, english..etc.

I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it's not gonna be much better than that. Just short-form text for the most part.

I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar.

I'm tech-savvy enough to make this work but I'd like some feedback from people more knowledgeable than me before I spend time and effort into this.

Thanks!

12 comments

[+] ktrnka|3 years ago|reply

I'd suggest starting with just building a high quality data set with text from a variety of domains, and starting off by publishing that. Maybe even developing some related tech like adding the dialect to language id packages. Another key thing might be to build a nicely curated word list for the dialect, and make sure there's good documentation for researchers wanting to work in the language.

Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here https://ai.googleblog.com/2023/01/google-research-2022-beyon...

But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.

Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.

[+] ktrnka|3 years ago|reply

Also I'm happy to help any way I can. I'm not sure the best practices of sharing contact info on HN but if you Google K Trnka language modeling I should be the only one

[+] LunarAurora|3 years ago|reply

Sadly, the very best datasets that seem publicly available are for Gulf Arabic dialect (where the money is) [1]

I suggest you contact https://www.icompass.tn/, a (Tunisian) startup specialized in Natural Language Processing...that process Arabic dialects and African languages

On a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI/NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly "oral" dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.

[1] https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/...

[2] https://news.ycombinator.com/item?id=34492572

[+] the_generalist|3 years ago|reply

Hey, thanks for the reply, those are some very good points your raised. I'll explore the resources you shared as well.

The trick with this kind of project is the outcome. The way I was thinking about it was mostly as a personal side project. But if it requires more resources and effort than that then it's a different topic.

It's not clear who'd benefit from this, beyond an interesting curiosity to toy with here and there.

[+] yorwba|3 years ago|reply

If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny.

If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter https://commoncrawl.org/

[+] the_generalist|3 years ago|reply

That's really interesting. I really wanted to keep things simple: get the data (scrape Twitter for ex), use HuggingFace's AutoML or something similar. I'm not sure if this is even possible but this was my initial "pipeline".

[+] barrenko|3 years ago|reply

Do you know of further link or resources in this "direction", this is awesome.

[+] enoreyes|3 years ago|reply

https://huggingface.co/alger-ia/dziribert

There is this model which also has a paper describing their methods for a BERT-family model designed for the Algerian dialect.

[+] tooltitude|3 years ago|reply

You could do data augmentation. You could automatically translate (there're open source models to do so) to your language from close enough languages, and train your model on this data.