TildeOpen LLM: A New Era of European Language AI with Over 30 Billion Parameters
In the rapidly evolving landscape of artificial intelligence, a new contender has emerged, poised to redefine the capabilities and accessibility of Large Language Models (LLMs) for a significant portion of the global population. Tilde AI, a leading Latvian language technology firm, has officially launched TildeOpen LLM, an open-source foundational large language model with an impressive 30 billion parameters. This release is not merely an incremental update; it represents a strategic leap forward, particularly for European languages, aiming to bridge the linguistic divide often present in current AI technologies and bolster digital sovereignty within the European Union.
Addressing the European Language Gap
The current AI ecosystem, dominated by models trained predominantly on English, often struggles to deliver comparable performance for the diverse array of European languages. This linguistic bias leads to noticeable deficiencies, including awkward sentence structures, grammatical inaccuracies, and the misapplication of terms, especially when dealing with more complex or specialized tasks. TildeOpen LLM was developed with the explicit goal of overcoming these limitations. By focusing on languages frequently underrepresented in mainstream LLMs—such as those of the Baltic countries, Ukrainian, and Turkish—TildeOpen LLM promises a more accurate, nuanced, and culturally aware AI experience for speakers of these languages. Artūrs Vasiļevskis, CEO of Tilde, highlighted this crucial distinction, explaining that while popular commercial models may excel in English, TildeOpen was meticulously tailored to ensure robust performance across a wider spectrum of European tongues.
Security, Sovereignty, and EU Compliance
Beyond linguistic accuracy, TildeOpen LLM places a strong emphasis on data security and privacy, aligning with the European Union's stringent regulatory framework. A key feature is the model's capability to be hosted on a local server or within secure cloud storage. This self-hosting option provides organizations with direct control over their data, ensuring that sensitive information remains within their premises or a trusted, EU-compliant environment. This is a critical differentiator from many global commercial models that are typically hosted in data centers outside the EU, potentially posing challenges for compliance with regulations like the GDPR and the upcoming AI Act. Tilde's commitment to European data protection standards is a cornerstone of TildeOpen LLM's design, offering a trustworthy AI solution for businesses and public institutions operating within the EU.
A Foundation Built on European Supercomputing Power
The development of TildeOpen LLM was significantly enabled by access to cutting-edge European supercomputing resources. As a recipient of the European Commission's "Large AI Grand Challenge," Tilde was awarded substantial computational power, including approximately 2 million GPU hours on the LUMI supercomputer in Finland and the JUPITER supercomputer. This access to high-performance computing was instrumental in training a model of TildeOpen LLM's scale and complexity. The training process itself involved sophisticated techniques, utilizing EleutherAI-inspired GPT-NeoX scripts over roughly 450,000 update steps and consuming approximately 2 trillion training tokens. The training regimen incorporated a three-stage sampling strategy designed to ensure equitable representation across languages, balancing uniform distribution with natural language distribution to boost performance for high-data-volume languages while also rebalancing rarer language examples.
Open Source: Fostering Collaboration and Innovation
A defining characteristic of TildeOpen LLM is its open-source nature. Released under a permissive CC-BY-4.0 license, the model is freely accessible to a wide range of users, including national authorities, companies, scientists, students, and various industry sectors such as medical, financial, and insurance. This open approach is intended to democratize access to advanced AI technology, enabling developers to fine-tune the base model for specific assignments. For instance, organizations can develop custom AI assistants proficient in European languages or build specialized translation models. The open-source availability is expected to foster a vibrant ecosystem of innovation, encouraging community-driven development and the creation of tailored AI solutions across Europe.
Technical Architecture and Performance
TildeOpen LLM is built as a 30-billion-parameter dense decoder-only transformer. Its architecture features 60 transformer layers, an embedding size of 6,144, and 48 attention heads, with a context window capable of handling 8,192 tokens. The model employs SwiGLU activation functions, RoPE positional encoding, and RMSNorm layer normalization, design choices that favor efficient handling of long contexts and robust multilingual inference. The performance of TildeOpen LLM has been rigorously evaluated on various benchmarks. Notably, it sets a new state-of-the-art on the Belebele reading comprehension benchmark, achieving an average accuracy of 84.7%, outperforming leading models like Gemma-27B, ALIA-40B, and EuroLLM-22B. Its superiority is particularly evident in languages often underserved by global models, such as Icelandic and Finnish, where it demonstrates significantly higher accuracy. Furthermore, TildeOpen LLM exhibits remarkable efficiency gains in morphologically rich languages like Latvian and Lithuanian compared to models such as LLaMA-3, GPT-4o, and Mistral, making it a faster, more accurate, and sustainable alternative for European languages.
A Strategic Vision for European AI
The release of TildeOpen LLM is more than just a technological achievement; it represents a strategic vision for Europe's role in the global AI landscape. By developing its own world-class foundational models, Europe aims to reduce its dependence on AI solutions developed elsewhere, thereby enhancing its technological independence and fostering its own AI infrastructure. Tilde's CEO, Artūrs Vasiļevskis, emphasized this point, stating that "For Europe to be truly sovereign in AI, we must move beyond dependence on English-centric models built elsewhere." Tilde is already working on extending TildeOpen's context length and developing instruction-tuned versions for specialized European tasks, such as legal translation and e-government services, further solidifying its commitment to building a robust and inclusive AI ecosystem for Europe.
The Path Forward
TildeOpen LLM is positioned as a foundational model, serving as a base for future specialized AI solutions. Its open-source nature, coupled with its strong performance and focus on European languages, makes it a compelling option for researchers, developers, and organizations looking to leverage the power of AI while respecting linguistic diversity and data privacy. The model is available for download and use, inviting the global community to explore its capabilities and contribute to its ongoing development.
AI Summary
Tilde AI, a prominent Latvian language technology firm, has introduced TildeOpen LLM, a significant advancement in the field of artificial intelligence. This open-source foundational large language model (LLM) boasts over 30 billion parameters and is meticulously engineered to cater to the diverse linguistic landscape of Europe, with a particular emphasis on under-represented national and regional languages. The release signifies a strategic stride towards achieving linguistic equity and bolstering digital sovereignty across the European Union. TildeOpen LLM was publicly released on September 3, 2025, and is available free of charge via the Hugging Face platform. Architecturally, it is a 30-billion-parameter dense decoder-only transformer, released under a permissive CC-BY-4.0 license, offering broad language support that spans from Latvian and Lithuanian to Ukrainian, Turkish, and numerous other European languages. The training of this extensive model was conducted on European supercomputers, specifically LUMI in Finland and JUPITER, utilizing approximately 2 million GPU hours awarded through the European Commission’s Large AI Grand Challenge. This substantial computational power enabled Tilde to undertake an extended training schedule and experiment with sophisticated language sampling strategies aimed at ensuring balanced model performance across various languages. The training process itself employed EleutherAI-inspired GPT-NeoX scripts, involving around 450,000 update steps and consuming approximately 2 trillion training tokens. The sampling methodology was a three-stage regimen, beginning with a uniform pass across all languages, followed by a natural distribution phase to enhance exposure for high-data-volume languages, and concluding with a final uniform sweep to rebalance the representation of rarer languages. Key technical specifications of the model include 60 transformer layers, an embedding dimension of 6,144, and 48 attention heads, with a context window supporting 8,192 tokens. Architectural choices such as SwiGLU activation functions, RoPE positional encoding, and RMSNorm layer normalization were made to optimize for long-context handling and efficient multilingual inference. A critical issue addressed by TildeOpen LLM is the prevalent bias in mainstream large language models, which are predominantly trained on English and other major languages. This English-centric approach often results in performance degradation when these models are applied to Baltic, Slavic, or other smaller European languages, manifesting as grammatical errors, awkward phrasing, and an increased propensity for hallucinations. TildeOpen LLM tackles this by incorporating an "equitable tokenizer." This specialized tokenizer is engineered to represent text in a consistent manner across different languages, thereby reducing token counts, particularly for morphologically rich languages, and significantly improving inference efficiency for under-represented languages. Furthermore, the model