EuroLLM: LLM made in Europe built to support all 24 official EU languages

[+] adzm|5 months ago|reply

For those curious, the 24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish.

Maltese, interestingly, is the only Afro-Asiatic derived language.

Hungarian, Finnish, and Estonian are the three Uralic languages.

All the others are Indo-European, Greek being the only Hellenic one, Irish the only Celtic, the rest are Baltic, Slavic, Italic, or Germanic.

(I originally used the term Balto-Slavic, though I was unaware of some of the connotations of that term until just now. Baltic and Slavic do share a common origin, but that was a very very long time ago)

[+] arbuge|5 months ago|reply

> Maltese, interestingly, is the only Afro-Asiatic derived language.

It's Semitic, to be precise.

https://en.wikipedia.org/wiki/Semitic_languages

[+] Vinnl|5 months ago|reply

Tomorrow there are elections in the Netherlands, and two parties are proposing adding Frysian to that list: https://neerlandistiek.nl/2025/10/kies-voor-taal/

Best get to retraining those models.

[+] purrcat259|5 months ago|reply

I read, write and speak Maltese, AMA if you are curious about the language.

[+] jim180|5 months ago|reply

Lithuanian and Latvian are Baltic languages. Nothing to do with Slavic...

[+] cyfex|5 months ago|reply

> Greek being the only Hellenic one

Are there really any other Hellenic languages besides Greek?

[+] sva_|5 months ago|reply

Seems like the model isn't limited to those though, from the paper:

> as well as some additional relevant languages (Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian).

https://arxiv.org/pdf/2409.16235

The paper also goes into detail on training set sources, which I feel like a curation thereof might be considered the main contribution of this publication?

[+] ChrisMarshallNY|5 months ago|reply

Flemish? I remember watching a TV show in Flemish (Hotel Beau Séjour[0]), so it's prevalent enough to invest that kind of money into.

What about Basque? Is that too controversial?

[0] https://en.wikipedia.org/wiki/Hotel_Beau_Séjour

[+] Stagnant|5 months ago|reply

Title is missing "(2024)". The 9B model was released last december[0].

0: https://sites.google.com/view/eurollm/home

[+] htrp|5 months ago|reply

>The EuroLLM Team brings together some of the brightest minds in AI including Unbabel, Instituto Tecnico Lisbon, the University of Edinburgh, Instituto de Telecommunicacoes, Université Paris-Saclay, Aveni, Sorbonne University, Naver Labs, and the University of Amsterdam.

>Europe is the only continent in the world to have a large public network of supercomputers that are managed by the EuroHPC Joint Undertaking (EuroHPC JU). As soon as we received the EuroHPC JU access to the supercomputer, we were ready to roll up our sleeves and get to work. We developed the small model right away and in less than 6 months the second model was ready.

[1] https://www.eurohpc-ju.europa.eu/eurohpc-success-story-speak...

Repurposing some of that physics sim compute

[+] loandbehold|5 months ago|reply

Aren't all frontier models already able to use all these languages? Support for specific languages doesn't need to be built in, LLMs support all languages because they are trained on multilingual data.

[+] melvinmelih|5 months ago|reply

> because they are trained on multilingual data

But they were not trained on government-sanctioned homegrown EU data.

[+] tensor|5 months ago|reply

No, that's not how training works. It's not just about having an example in a given language, but also how many examples and the ratio of examples compared to other languages. English hugely eclipses any other language on most US models and that's why performance on other languages is subpar compared to performance on english.

[+] charlieyu1|5 months ago|reply

Training is a very different thing. Can’t speak for European, but LLMs are often much worse in Japanese because tokenisation used Unicode and a single Japanese character often has to be represented by more than one token

[+] intended|5 months ago|reply

Nope. Capability begins to degrade once you move away from english.

Plus all your T&S/AI Safety is not solved with translation, you need lexicons and data sets of examples.

Like, people use someone in Malaysia, to label the Arabic spoken by someone playing a video game in Doha - the cultural context is missing.

The best proxy to show the degree of lopsidedness was from this : https://cdt.org/insights/lost-in-translation-large-language-...

Which in turn had to base it on this: https://stats.aclrollingreview.org/submissions/linguistic-di...

From what I am aware of, LLM capability degrades once you move out of English, and many nation states are either building, or considering the option of building their own LLMs.

[+] numpad0|5 months ago|reply

Not natively, they all sound translated in languages other than English. I occasionally come across French people complaining about LLMs' use of non-idiomatic French, but it's probably not a French problem at all, considering that this effort includes so many Indo-European languages.

[+] ideasarecool|4 months ago|reply

Term support is vague. Can you do basic interaction in most other languages? Sure. Is it anywhere close to competence it has in english? No. Most models seem to just translate english responses at beginners simplistic monotone level.

[+] whazor|5 months ago|reply

European governments have huge collections of digitalised books, research, public data.

But also European culture could maybe make a difference? You can already see big differences between Grok and ChatGPT in terms of values.

[+] lm28469|5 months ago|reply

Meh, it depends a lot on the dataset, which are heavily skewed towards the main languages. For example they almost always confuse Czech and Slovak and often swap one for the other in middle of chats

[+] unknown|5 months ago|reply

[deleted]

[+] adt|5 months ago|reply

The EuroLLM-9B model release is from Dec/2024, and scores just above random chance for benchmarks like MMLU-Pro (17.6%, random chance is 10%).

Comparison with similar EU models + 600 other highlights:

https://lifearchitect.ai/models-table/

[+] redfloatplane|5 months ago|reply

Some cursory clicking about didn't reveal to me the actual corpus they used, only that it is several trillion tokens 'divided across the languages'. I'm curious mainly because Irish (among some other similarly endangered languages on the list) typically has any large corpus come from legal/governmental texts that are required to be translated. There must surely be only a relatively tiny amount of colloquial Irish in the corpus. It be interesting to see some evals in each language particularly with native speakers.

I think LLMs may be on the whole very positive for endangered languages such as Irish, but before it becomes positive I think there's an amount of danger to be navigated (see Scots Gaelic wikipedia drama for example)

In any case I think this is a great initiative.

[+] srameshc|5 months ago|reply

I was thinking the same, why are so many superior models coming from only countries like US and China. And why are European countries not in the list other than France with Mistral. Why are so few companies in India, Japan, South Korea even close to a promising new model like what Chinese companies did ?

[+] sireat|5 months ago|reply

It is interesting how much traction this 9B model is getting which is good.

Still two month earlier 19 European language model with 30B parameters got almost no mention:

https://huggingface.co/TildeAI/TildeOpen-30b

Mind you that is another open model that is begging for fine-tuning (it is not very good out of box).

[+] jamesblonde|4 months ago|reply

The leading European ECommerce Company, Zalando with 50m users, is now using the leading European AI platform, Hopsworks, to power their real-time AI. Zalando are Databricks largest EU customer, but they are using Hopsworks instead for operational AI.

You would never hear it, though, as European IT press only promotes SV startups

https://www.youtube.com/watch?v=u8QFiLhnuFg&feature=youtu.be

Disclaimer, i work at Hopsworks.

[+] extraduder_ire|5 months ago|reply

From the EuroLLM-9B page on hugginface;

>You need to agree to share your contact information to access this model

Is this common? I've never seen it on the site before, and it isn't on the smaller model. What are they collecting this information for?

[+] Symmetry|4 months ago|reply

Do they need this? Modern LLMs are trained on text in many languages and even if I throw medieval Gallician poetry at them[1] they can handle it.

[1]https://genius.com/Qntal-vedes-amigo-lyrics

[+] KronisLV|5 months ago|reply

Here's the models: https://huggingface.co/utter-project/models

I used the 9B Instruct version, from the small models, it was the one with the best Latvian knowledge out there, bar none. GPT-OSS 20B and Qwen3 30B A3B and similar ones weren't even close.

That said, the model itself was a little bit dumb and not something you'd really use for programming/autocomplete or tool calling or anything like that, which also presented some problems - even for processing text, if you need RAG or tool server calls, you need to use something like Qwen3 for the actual logic and then pass the contents to EuroLLM for translation/formatting with the instructions, at which point your n8n workflow looks a bit messy and also you have to run those two models instead of only one.

Meanwhile, the best cloud model for Latvian that I've found so far was Google Gemini 2.5 Pro, but obviously can't use cloud models in certain on-prem use cases.

[+] whimsicalism|5 months ago|reply

[flagged]

[+] dzikimarian|5 months ago|reply

While grant process in EU isn't fun, I think Levels has bit of an ego issues. He mentioned that if he had issues like that on eg X, he would see Elon himself in the replies.

While he is great at converting his influencer status to income in his micro-SaaS projects, I don't think running ad-fueled browser games on state-sponsored super computer should be really aim of these grant programs.

[+] deaux|5 months ago|reply

> What's REALLY much more important though if you want to be a part of the AI race and I've posted for years here with @euaccofficial is to make Europe a really extremely attractive place to start and run an AI business. Remove regulatory obstructions and give tax discounts for startups. Let them build a business first that can compete worldwide and once they make enough money (let's say $100M/y), then slowly start adding regulation.

When you talk to most EU business owners, even in tech, the limiting factor isn't regulations. This being the #1 reason is such a tired trope.

Ironically, China has in some ways a bigger regulatory burden when it comes to software, as there if the government doesn't approve the business is dead in the water. I doubt that Klarna would've gotten off the ground there, for one, I could see them being shut down much earlier there. In the EU only now very slowly are some governments even starting to talk about some weak measures around their business model. But I've never, not once in my life, heard "Chinese software companies can't get off the ground due to the regulatory burden".

The same people who clamor about the EU regulations are the ones who hate on the EU for their protectionist measures against US tech. Yet another bout of irony here - China's software industry has flourished exactly thanks to 10 times stronger protectionist measures against US tech. So has Korea's, and their protectionism has never even been anywhere on the China level, more inbetween EU and China. No, if there's anything that would help, it's much more tech protectionism in the EU.

Pieter Levels is at the end of the day an influencer, not a serious founder.

[+] softwaredoug|5 months ago|reply

Is the point of these policies to pick winners? Or to upskill the creators and stimulate the economy by giving possible entrepreneurs experience Europeans can't get in big tech?

In the US, some ex-Googler might found a startup. Europe doesn't have the equivalent of FAANG. (Europe-wide companies are not quite as easy as US-wide)

Even if the super computer itself "fails", is the goal actually the secondary impacts to the economy?

(And in the US, we do our own fair share of picking winners / losers, especially in the current regime)

[+] tomhow|4 months ago|reply

Please don't fulminate on HN. A sentence like "Actually nuts to me the degree to which European policymakers do not even begin to understand" is inflammatory rhetoric of the kind we're trying to avoid here on HN. The question of how countries/economic unions can funtion most effectively is a topic worthy of serious discussion, but these discussions can be far more fruitful it they're approached with curiosity rather than rage.

[+] antman|5 months ago|reply

What are the effects of pick the winner strategy? Sounds intriguing

[+] troupo|5 months ago|reply

Levels is engagement farming. Instead of uncritically reposting him you could've gone ahead and read what the cluster is for: https://x.com/dmitriid/status/1982927767286231403

Cluster: for public benefit, cutting edge research in biotech, medical, robotics.

Levels: I want to create AI photos of people for my AI Slop startup

[+] tinco|5 months ago|reply

Yeah no, it's just not how it works. They're trying to support fundamental research and they have limited resources to accomplish them. Some random dude who wants to build a company that generates pretty AI pictures is just not the target audience, and he rightly got rejected.

And frankly, the dream scenario that Pieter describes where he somehow would qualify for these resources also wouldn't help kickstart the tech industry, and it's also not how it works in the states.

What does help, and what European governments (at least the one in The Netherlands that Pieter is from) actually do, is more funding for startups. If you're a startup founder in NL almost every angel you talk to has a matched funding deal with the government. That's such a smart way of keeping up with the US. Do you think US startups get free compute from the government? They don't even get subsidies most of the time. What they get is better funding because there's more capital available, and helping investors with that is exactly how you solve that.

[+] ainiriand|4 months ago|reply

Yeah let's start doing things like they do it in the US perhaps we can end up like them!

[+] albedoa|5 months ago|reply

Others are giving you too much benefit. Without looking it up, can you tell us who "levelsio" is? How do you think you know about him?

[+] saubeidl|5 months ago|reply

[flagged]

[+] hopelite|5 months ago|reply

It’s about control. It is a fundamental difference in perspective and mentality regarding control. America is or at least was largely oriented around indirect control or outcome oriented objectives, where Europe is largely oriented around direct control or prescribed objectives.

It’s also enshrined in our respective governments and the foundational philosophies that underpin them. The US Declaration of Independence sets out to describe that the natural rights of the men who created the USA are preeminent and the Constitution lists how those rights may not be infringed upon, i.e. It creates laws that binds and limits the actions of government, something that was and has never been emulated since. Where across Europe, you simply do not have anything even remotely similar and the law inversely describes what you are permitted by government to do instead.

It is effectively descriptive vs prescriptive law and underlying philosophies. It is something I have had the hardest time on occasion getting my European friends to really internalize, seemingly because it’s so contrary to what their conditioned with all their life, i.e., the government is essentially the matter that grants you what it grants you, not that you have rights that the government may not infringe upon.

But to be fair, this possibly European tendency to dominate and control what you may and may not do and when and how, is and long has been creeping into the USA too, arguably since even the 12th amendment to the Constitution and getting worse with every amendment since, layers upon layers of contradicting and conflicting flaws and bugs that will be reading their ugly heed here in about two years, when Trump may run for President again. And if you don’t think he can, you simply don’t understand what a spaghetti code the Constitution is after the 11th amendment, hack after hack building up mountains of debt that is going to come due in our lives.

[+] thefz|4 months ago|reply

> Actually nuts to me the degree to which European policymakers do not even begin to understand how to kickstart technologically-intensive industry.

Ah the good old "Europe can't do Silicon Valley" trope.

[+] p0w3n3d|4 months ago|reply

[deleted]

[+] webdevver|5 months ago|reply

[flagged]

[+] DrNosferatu|5 months ago|reply

1. It's a nice start, but the EU has to scale to Manhattan Project levels in order to properly compete with the US and China.

2. A credible scale effort for EU own silicon for AI Compute, wouldn't hurt either.

3. And this can only be achieved by vertical integration to combat fragmentation.

[+] sorenjan|5 months ago|reply

If I want to use an LLM to do translation, should I use a base model or an instruction tuned version? I've had mixed results using the chat models and a simple "Translate this to <language>: "

[+] kreetx|5 months ago|reply

I'm somewhat skeptical of taxpayer funded innovation. Seen a few Horizon grants from the side, as a citizen I'd prefer to not pay for them, but unfortunately can't opt out.

[+] zyngaro|4 months ago|reply

No github link, high performance claim with 0 numbers. No technical details. 0 interesting link to learn more. But hey a key people page is there so that's ok.

[+] nonethewiser|5 months ago|reply

How does this work?

It seems like it, in most ways, it would be bad to train on 24 separate languages. That's just 24 partitions to the data. Seems really inefficient and better to simply train in the biggest (english) and translate.

I do think this will introduce some biases that correlate with the English language. It would be interesting to see more specifically what this means. But regardless, I don't think you can produce a competitive model with such a large subdivision of training data.

[+] supermatt|5 months ago|reply

> It is fully open source and available via Hugging Face.

This model was released in 2024, and I couldn't find any links to the training data - is it just an open weights model?

[+] cess11|5 months ago|reply

In this vein there's also the recent swiss Apertus.

https://www.swiss-ai.org/apertus

606 comments