For those curious, the 24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish.
Maltese, interestingly, is the only Afro-Asiatic derived language.
Hungarian, Finnish, and Estonian are the three Uralic languages.
All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.
(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)
Seems like the model isn't limited to those though, from the paper:
> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).
The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?
>The EuroLLM Team brings together some of the brightest minds in AI including Unbabel, Instituto Tecnico Lisbon, the University of Edinburgh, Instituto de Telecommunicacoes, Université Paris-Saclay, Aveni, Sorbonne University, Naver Labs, and the University of Amsterdam.
>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready.
Aren't all frontier models already able to use all these languages? Support for specific languages doesn't need to be built in, LLMs support all languages because they are trained on multilingual data.
No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.
Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token
From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.
Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.
Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.
Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats
Some cursory clicking about didn't reveal to me the actual corpus they used, only that it is several trillion tokens 'divided across the languages'. I'm curious mainly because Irish (among some other similarly endangered languages on the list) typically has any large corpus come from legal/governmental texts that are required to be translated. There must surely be only a relatively tiny amount of colloquial Irish in the corpus. It be interesting to see some evals in each language particularly with native speakers.
I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)
I was thinking the same, why are so many superior models coming from only countries like US and China. And why are European countries not in the list other than France with Mistral. Why are so few companies in India, Japan, South Korea even close to a promising new model like what Chinese companies did ?
The leading European ECommerce Company, Zalando with 50m users, is now using the leading European AI platform, Hopsworks, to power their real-time AI. Zalando are Databricks largest EU customer, but they are using Hopsworks instead for operational AI.
You would never hear it, though, as European IT press only promotes SV startups
I used the 9B Instruct version, from the small models, it was the one with the best Latvian knowledge out there, bar none. GPT-OSS 20B and Qwen3 30B A3B and similar ones weren't even close.
That said, the model itself was a little bit dumb and not something you'd really use for programming/autocomplete or tool calling or anything like that, which also presented some problems - even for processing text, if you need RAG or tool server calls, you need to use something like Qwen3 for the actual logic and then pass the contents to EuroLLM for translation/formatting with the instructions, at which point your n8n workflow looks a bit messy and also you have to run those two models instead of only one.
Meanwhile, the best cloud model for Latvian that I've found so far was Google Gemini 2.5 Pro, but obviously can't use cloud models in certain on-prem use cases.
While grant process in EU isn't fun, I think Levels has bit of an ego issues. He mentioned that if he had issues like that on eg X, he would see Elon himself in the replies.
While he is great at converting his influencer status to income in his micro-SaaS projects, I don't think running ad-fueled browser games on state-sponsored super computer should be really aim of these grant programs.
> What's REALLY much more important though if you want to be a part of the AI race and I've posted for years here with @euaccofficial
is to make Europe a really extremely attractive place to start and run an AI business. Remove regulatory obstructions and give tax discounts for startups. Let them build a business first that can compete worldwide and once they make enough money (let's say $100M/y), then slowly start adding regulation.
When you talk to most EU business owners, even in tech, the limiting factor isn't regulations. This being the #1 reason is such a tired trope.
Ironically, China has in some ways a bigger regulatory burden when it comes to software, as there if the government doesn't approve the business is dead in the water. I doubt that Klarna would've gotten off the ground there, for one, I could see them being shut down much earlier there. In the EU only now very slowly are some governments even starting to talk about some weak measures around their business model. But I've never, not once in my life, heard "Chinese software companies can't get off the ground due to the regulatory burden".
The same people who clamor about the EU regulations are the ones who hate on the EU for their protectionist measures against US tech. Yet another bout of irony here - China's software industry has flourished exactly thanks to 10 times stronger protectionist measures against US tech. So has Korea's, and their protectionism has never even been anywhere on the China level, more inbetween EU and China. No, if there's anything that would help, it's much more tech protectionism in the EU.
Pieter Levels is at the end of the day an influencer, not a serious founder.
Is the point of these policies to pick winners? Or to upskill the creators and stimulate the economy by giving possible entrepreneurs experience Europeans can't get in big tech?
In the US, some ex-Googler might found a startup. Europe doesn't have the equivalent of FAANG. (Europe-wide companies are not quite as easy as US-wide)
Even if the super computer itself "fails", is the goal actually the secondary impacts to the economy?
(And in the US, we do our own fair share of picking winners / losers, especially in the current regime)
Please don't fulminate on HN. A sentence like "Actually nuts to me the degree to which European policymakers do not even begin to understand" is inflammatory rhetoric of the kind we're trying to avoid here on HN. The question of how countries/economic unions can funtion most effectively is a topic worthy of serious discussion, but these discussions can be far more fruitful it they're approached with curiosity rather than rage.
Yeah no, it's just not how it works. They're trying to support fundamental research and they have limited resources to accomplish them. Some random dude who wants to build a company that generates pretty AI pictures is just not the target audience, and he rightly got rejected.
And frankly, the dream scenario that Pieter describes where he somehow would qualify for these resources also wouldn't help kickstart the tech industry, and it's also not how it works in the states.
What does help, and what European governments (at least the one in The Netherlands that Pieter is from) actually do, is more funding for startups. If you're a startup founder in NL almost every angel you talk to has a matched funding deal with the government. That's such a smart way of keeping up with the US. Do you think US startups get free compute from the government? They don't even get subsidies most of the time. What they get is better funding because there's more capital available, and helping investors with that is exactly how you solve that.
It’s about control. It is a fundamental difference in perspective and mentality regarding control. America is or at least was largely oriented around indirect control or outcome oriented objectives, where Europe is largely oriented around direct control or prescribed objectives.
It’s also enshrined in our respective governments and the foundational philosophies that underpin them. The US Declaration of Independence sets out to describe that the natural rights of the men who created the USA are preeminent and the Constitution lists how those rights may not be infringed upon, i.e. It creates laws that binds and limits the actions of government, something that was and has never been emulated since. Where across Europe, you simply do not have anything even remotely similar and the law inversely describes what you are permitted by government to do instead.
It is effectively descriptive vs prescriptive law and underlying philosophies. It is something I have had the hardest time on occasion getting my European friends to really internalize, seemingly because it’s so contrary to what their conditioned with all their life, i.e., the government is essentially the matter that grants you what it grants you, not that you have rights that the government may not infringe upon.
But to be fair, this possibly European tendency to dominate and control what you may and may not do and when and how, is and long has been creeping into the USA too, arguably since even the 12th amendment to the Constitution and getting worse with every amendment since, layers upon layers of contradicting and conflicting flaws and bugs that will be reading their ugly heed here in about two years, when Trump may run for President again. And if you don’t think he can, you simply don’t understand what a spaghetti code the Constitution is after the 11th amendment, hack after hack building up mountains of debt that is going to come due in our lives.
If I want to use an LLM to do translation, should I use a base model or an instruction tuned version? I've had mixed results using the chat models and a simple "Translate this to <language>: "
I'm somewhat skeptical of taxpayer funded innovation. Seen a few Horizon grants from the side, as a citizen I'd prefer to not pay for them, but unfortunately can't opt out.
No github link, high performance claim with 0 numbers. No technical details. 0 interesting link to learn more. But hey a key people page is there so that's ok.
It seems like it, in most ways, it would be bad to train on 24 separate languages. That's just 24 partitions to the data. Seems really inefficient and better to simply train in the biggest (english) and translate.
I do think this will introduce some biases that correlate with the English language. It would be interesting to see more specifically what this means. But regardless, I don't think you can produce a competitive model with such a large subdivision of training data.
[+] [-] adzm|5 months ago|reply
Maltese, interestingly, is the only Afro-Asiatic derived language.
Hungarian, Finnish, and Estonian are the three Uralic languages.
All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.
(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)
[+] [-] arbuge|5 months ago|reply
It's Semitic, to be precise.
https://en.wikipedia.org/wiki/Semitic_languages
[+] [-] Vinnl|5 months ago|reply
Best get to retraining those models.
[+] [-] purrcat259|5 months ago|reply
[+] [-] jim180|5 months ago|reply
[+] [-] cyfex|5 months ago|reply
Are there really any other Hellenic languages besides Greek?
[+] [-] sva_|5 months ago|reply
> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).
https://arxiv.org/pdf/2409.16235
The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?
[+] [-] ChrisMarshallNY|5 months ago|reply
What about Basque? Is that too controversial?
[0] https://en.wikipedia.org/wiki/Hotel_Beau_Séjour
[+] [-] Stagnant|5 months ago|reply
0: https://sites.google.com/view/eurollm/home
[+] [-] htrp|5 months ago|reply
>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready.
[1] https://www.eurohpc-ju.europa.eu/eurohpc-success-story-speak...
Repurposing some of that physics sim compute
[+] [-] loandbehold|5 months ago|reply
[+] [-] melvinmelih|5 months ago|reply
But they were not trained on government-sanctioned homegrown EU data.
[+] [-] tensor|5 months ago|reply
[+] [-] charlieyu1|5 months ago|reply
[+] [-] intended|5 months ago|reply
Plus all your T&S/AI Safety is not solved with translation, you need lexicons and data sets of examples.
Like, people use someone in Malaysia, to label the Arabic spoken by someone playing a video game in Doha - the cultural context is missing.
The best proxy to show the degree of lopsidedness was from this : https://cdt.org/insights/lost-in-translation-large-language-...
Which in turn had to base it on this: https://stats.aclrollingreview.org/submissions/linguistic-di...
From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.
[+] [-] numpad0|5 months ago|reply
[+] [-] ideasarecool|4 months ago|reply
[+] [-] whazor|5 months ago|reply
But also European culture could maybe make a difference? You can already see big differences between Grok and ChatGPT in terms of values.
[+] [-] lm28469|5 months ago|reply
[+] [-] unknown|5 months ago|reply
[deleted]
[+] [-] adt|5 months ago|reply
Comparison with similar EU models + 600 other highlights:
https://lifearchitect.ai/models-table/
[+] [-] redfloatplane|5 months ago|reply
I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)
In any case I think this is a great initiative.
[+] [-] srameshc|5 months ago|reply
[+] [-] sireat|5 months ago|reply
Still two month earlier 19 European language model with 30B parameters got almost no mention:
https://huggingface.co/TildeAI/TildeOpen-30b
Mind you that is another open model that is begging for fine-tuning (it is not very good out of box).
[+] [-] jamesblonde|4 months ago|reply
You would never hear it, though, as European IT press only promotes SV startups
https://www.youtube.com/watch?v=u8QFiLhnuFg&feature=youtu.be
Disclaimer, i work at Hopsworks.
[+] [-] extraduder_ire|5 months ago|reply
>You need to agree to share your contact information to access this model
Is this common? I've never seen it on the site before, and it isn't on the smaller model. What are they collecting this information for?
[+] [-] Symmetry|4 months ago|reply
[1]https://genius.com/Qntal-vedes-amigo-lyrics
[+] [-] KronisLV|5 months ago|reply
I used the 9B Instruct version, from the small models, it was the one with the best Latvian knowledge out there, bar none. GPT-OSS 20B and Qwen3 30B A3B and similar ones weren't even close.
That said, the model itself was a little bit dumb and not something you'd really use for programming/autocomplete or tool calling or anything like that, which also presented some problems - even for processing text, if you need RAG or tool server calls, you need to use something like Qwen3 for the actual logic and then pass the contents to EuroLLM for translation/formatting with the instructions, at which point your n8n workflow looks a bit messy and also you have to run those two models instead of only one.
Meanwhile, the best cloud model for Latvian that I've found so far was Google Gemini 2.5 Pro, but obviously can't use cloud models in certain on-prem use cases.
[+] [-] whimsicalism|5 months ago|reply
[+] [-] dzikimarian|5 months ago|reply
While he is great at converting his influencer status to income in his micro-SaaS projects, I don't think running ad-fueled browser games on state-sponsored super computer should be really aim of these grant programs.
[+] [-] deaux|5 months ago|reply
When you talk to most EU business owners, even in tech, the limiting factor isn't regulations. This being the #1 reason is such a tired trope.
Ironically, China has in some ways a bigger regulatory burden when it comes to software, as there if the government doesn't approve the business is dead in the water. I doubt that Klarna would've gotten off the ground there, for one, I could see them being shut down much earlier there. In the EU only now very slowly are some governments even starting to talk about some weak measures around their business model. But I've never, not once in my life, heard "Chinese software companies can't get off the ground due to the regulatory burden".
The same people who clamor about the EU regulations are the ones who hate on the EU for their protectionist measures against US tech. Yet another bout of irony here - China's software industry has flourished exactly thanks to 10 times stronger protectionist measures against US tech. So has Korea's, and their protectionism has never even been anywhere on the China level, more inbetween EU and China. No, if there's anything that would help, it's much more tech protectionism in the EU.
Pieter Levels is at the end of the day an influencer, not a serious founder.
[+] [-] softwaredoug|5 months ago|reply
In the US, some ex-Googler might found a startup. Europe doesn't have the equivalent of FAANG. (Europe-wide companies are not quite as easy as US-wide)
Even if the super computer itself "fails", is the goal actually the secondary impacts to the economy?
(And in the US, we do our own fair share of picking winners / losers, especially in the current regime)
[+] [-] tomhow|4 months ago|reply
[+] [-] antman|5 months ago|reply
[+] [-] troupo|5 months ago|reply
Cluster: for public benefit, cutting edge research in biotech, medical, robotics.
Levels: I want to create AI photos of people for my AI Slop startup
[+] [-] tinco|5 months ago|reply
And frankly, the dream scenario that Pieter describes where he somehow would qualify for these resources also wouldn't help kickstart the tech industry, and it's also not how it works in the states.
What does help, and what European governments (at least the one in The Netherlands that Pieter is from) actually do, is more funding for startups. If you're a startup founder in NL almost every angel you talk to has a matched funding deal with the government. That's such a smart way of keeping up with the US. Do you think US startups get free compute from the government? They don't even get subsidies most of the time. What they get is better funding because there's more capital available, and helping investors with that is exactly how you solve that.
[+] [-] ainiriand|4 months ago|reply
[+] [-] albedoa|5 months ago|reply
[+] [-] saubeidl|5 months ago|reply
[+] [-] hopelite|5 months ago|reply
It’s also enshrined in our respective governments and the foundational philosophies that underpin them. The US Declaration of Independence sets out to describe that the natural rights of the men who created the USA are preeminent and the Constitution lists how those rights may not be infringed upon, i.e. It creates laws that binds and limits the actions of government, something that was and has never been emulated since. Where across Europe, you simply do not have anything even remotely similar and the law inversely describes what you are permitted by government to do instead.
It is effectively descriptive vs prescriptive law and underlying philosophies. It is something I have had the hardest time on occasion getting my European friends to really internalize, seemingly because it’s so contrary to what their conditioned with all their life, i.e., the government is essentially the matter that grants you what it grants you, not that you have rights that the government may not infringe upon.
But to be fair, this possibly European tendency to dominate and control what you may and may not do and when and how, is and long has been creeping into the USA too, arguably since even the 12th amendment to the Constitution and getting worse with every amendment since, layers upon layers of contradicting and conflicting flaws and bugs that will be reading their ugly heed here in about two years, when Trump may run for President again. And if you don’t think he can, you simply don’t understand what a spaghetti code the Constitution is after the 11th amendment, hack after hack building up mountains of debt that is going to come due in our lives.
[+] [-] thefz|4 months ago|reply
Ah the good old "Europe can't do Silicon Valley" trope.
[+] [-] p0w3n3d|4 months ago|reply
[deleted]
[+] [-] webdevver|5 months ago|reply
[+] [-] DrNosferatu|5 months ago|reply
2. A credible scale effort for EU own silicon for AI Compute, wouldn't hurt either.
3. And this can only be achieved by vertical integration to combat fragmentation.
[+] [-] sorenjan|5 months ago|reply
[+] [-] kreetx|5 months ago|reply
[+] [-] zyngaro|4 months ago|reply
[+] [-] nonethewiser|5 months ago|reply
It seems like it, in most ways, it would be bad to train on 24 separate languages. That's just 24 partitions to the data. Seems really inefficient and better to simply train in the biggest (english) and translate.
I do think this will introduce some biases that correlate with the English language. It would be interesting to see more specifically what this means. But regardless, I don't think you can produce a competitive model with such a large subdivision of training data.
[+] [-] supermatt|5 months ago|reply
This model was released in 2024, and I couldn't find any links to the training data - is it just an open weights model?
[+] [-] cess11|5 months ago|reply
https://www.swiss-ai.org/apertus