top | item 43691728

(no title)

miros_love | 10 months ago

For example: Slovene language. You simply don't have enough data on it. But if you add all the data that is available on related languages, you will get a higher quality. LLM fails with this property for low-resource languages.

discuss

order

Etheryte|10 months ago

I'm not sure I'm convinced. I speak a small European language and the general experience is that LLMs are often wrong exactly because they think they can just borrow from a related language. The result is even worse and often makes no sense whatsoever. In other words, as far as translations go, confidently incorrect is not useful.

yorwba|10 months ago

They train on 14 billion tokens in Slovene. Are you sure that's not enough?

miros_love|10 months ago

Unfortunately, yes.

We need more tokens, more variety of topics in texts and more complexity.