top | item 45098066

(no title)

jcuenod | 6 months ago

My day job involves training language models (mostly seq2seq) for low-resource languages (with substantially less data than 2GB of data).

A few thoughts:

1. You can't cut off the embedding layer or discard the tokenizer without throwing out the model you're starting with. The attention matrices are applied to and trained with the token embedding layer.

2. Basically the same thing regarding the tokenizer. If you need to add some tokens, that can be done (or you can repurpose existing tokens) if your script is unique (a problem I face periodically). But if you are initializing weights for new tokens, that means those tokens are untrained. So if you do that for all your data, you're training a new model.

3. The Gemma model series sounds like a good fit for your use case. I'm not confident about Hebrew support, let alone Hasidic Yiddish, but it is relatively multilingual (more so than many other open models). Being multilingual means that the odds are greater than they have tokens relevant to your corpus that have been trained towards an optimal point for your dataset.

4. If you can generate synthetic data with synonyms or POS tags, then great. But this is a language model, so you need to think how you can usefully teach it natural sequences of text (not how to tag nouns or identify synonyms - I also did a bunch of classic NLP, and it's depressing how irrelevant all that work is these days). I suspect that repurposing this data will not be worth it. So, if anything, I'd recommend doing that as a second pass.

5. Take a look at unsloth notebooks for training a gemma3 model and load up your data. I reckon it'll surprise you how effective these models are...

discuss

No comments yet.