Toms Bergmanis

2025

Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States
Jurgita Kapočiūtė-Dzikienė | Toms Bergmanis | Mārcis Pinnis
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defense, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight large language models support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama 3, Gemma 2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma 2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.

2022

We present the MTee project - a research initiative funded via an Estonian public procurement to develop machine translation technology that is open-source and free of charge. The MTee project delivered an open-source platform serving state-of-the-art machine translation systems supporting four domains for six language pairs translating from Estonian into English, German, and Russian and vice-versa. The platform also features grammatical error correction and speech translation for Estonian and allows for formatted document translation and automatic domain detection. The software, data and training workflows for machine translation engines are all made publicly available for further use and research.

pdf bib abs

Open Terminology Management and Sharing Toolkit for Federation of Terminology Databases
Andis Lagzdiņš | Uldis Siliņš | Toms Bergmanis | Mārcis Pinnis | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Consolidated access to current and reliable terms from different subject fields and languages is necessary for content creators and translators. Terminology is also needed in AI applications such as machine translation, speech recognition, information extraction, and other natural language processing tools. In this work, we facilitate standards-based sharing and management of terminology resources by providing an open terminology management solution - the EuroTermBank Toolkit. It allows organisations to manage and search their terms, create term collections, and share them within and outside the organisation by participating in the network of federated databases. The data curated in the federated databases are automatically shared with EuroTermBank, the largest multilingual terminology resource in Europe, allowing translators and language service providers as well as researchers and students to access terminology resources in their most current version.

2021

pdf bib abs

Dynamic Terminology Integration for COVID-19 and Other Emerging Domains
Toms Bergmanis | Mārcis Pinnis
Proceedings of the Sixth Conference on Machine Translation

The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom available for less-resourced languages and niche domains. Furthermore, as exemplified by COVID-19 recently, no domain-specific parallel data is readily available for emerging domains. However, the gravity of this recent calamity created a high demand for reliable translation of critical information regarding pandemic and infection prevention. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems that are capable of dynamic terminology integration at the time of translation. Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training.

bib abs

From Research to Production: Fine-Grained Analysis of Terminology Integration
Toms Bergmanis | Mārcis Pinnis | Paula Reichenberg
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Dynamic terminology integration in neural machine translation (NMT) is a sought-after feature of computer-aided translation tools among language service providers and small to medium businesses. Despite the recent surge in research on terminology integration in NMT, it still is seldom or inadequately supported in commercial machine translation solutions. In this presentation, we will share our experience of developing and deploying terminology integration capabilities for NMT systems in production. We will look at the three core tasks of terminology integration: terminology management, terminology identification, and translation with terminology. This talk will be insightful for NMT system developers, translators, terminologists, and anyone interested in translation projects.

pdf bib abs

Facilitating Terminology Translation with Target Lemma Annotations
Toms Bergmanis | Mārcis Pinnis
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are given in their dictionary forms; finding the right target language form is part of the translation process. We argue that the requirement for apriori specified target language forms is unrealistic and impedes the practical applicability of previous work. In this work, we propose to train machine translation systems using a source-side data augmentation method that annotates randomly selected source language words with their target language lemmas. We show that systems trained on such augmented data are readily usable for terminology integration in real-life translation scenarios. Our experiments on terminology translation into the morphologically complex Baltic and Uralic languages show an improvement of up to 7 BLEU points over baseline systems with no means for terminology integration and an average improvement of 4 BLEU points over the previous work. Results of the human evaluation indicate a 47.7% absolute improvement over the previous work in term translation accuracy when translating into Latvian.

2020

pdf bib abs

Mitigating Gender Bias in Machine Translation with Target Gender Annotations
Artūrs Stafanovičs | Toms Bergmanis | Mārcis Pinnis
Proceedings of the Fifth Conference on Machine Translation

When translating “The secretary asked for details.” to a language with grammatical gender, it might be necessary to determine the gender of the subject “secretary”. If the sentence does not contain the necessary information, it is not always possible to disambiguate. In such cases, machine translation systems select the most common translation option, which often corresponds to the stereotypical translations, thus potentially exacerbating prejudice and marginalisation of certain groups and people. We argue that the information necessary for an adequate translation can not always be deduced from the sentence being translated or even might depend on external knowledge. Therefore, in this work, we propose to decouple the task of acquiring the necessary information from the task of learning to translate correctly when such information is available. To that end, we present a method for training machine translation systems to use word-level annotations containing information about subject’s gender. To prepare training data, we annotate regular source language words with grammatical gender information of the corresponding target language words. Using such data to train machine translation systems reduces their reliance on gender stereotypes when information about the subject’s gender is available. Our experiments on five language pairs show that this allows improving accuracy on the WinoMT test set by up to 25.8 percentage points.

pdf bib

A Tale of Eight Countries or the EU Council Presidency Translator in Retrospect
Mārcis Pinnis | Toms Bergmanis | Kristīne Metuzāle | Valters Šics | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

2019

pdf bib abs

Data Augmentation for Context-Sensitive Neural Lemmatization Using Inflection Tables and Raw Text
Toms Bergmanis | Sharon Goldwater
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. Using context can help, both for unseen and ambiguous words. Yet most context-sensitive approaches require full lemma-annotated sentences for training, which may be scarce or unavailable in low-resource languages. In addition (as shown here), in a low-resource setting, a lemmatizer can learn more from n labeled examples of distinct words (types) than from n (contiguous) labeled tokens, since the latter contain far fewer distinct types. To combine the efficiency of type-based learning with the benefits of context, we propose a way to train a context-sensitive lemmatizer with little or no labeled corpus data, using inflection tables from the UniMorph project and raw text examples from Wikipedia that provide sentence contexts for the unambiguous UniMorph examples. Despite these being unambiguous examples, the model successfully generalizes from them, leading to improved results (both overall, and especially on unseen words) in comparison to a baseline that does not use context.

2018

pdf bib abs

Context Sensitive Neural Lemmatization with Lematus
Toms Bergmanis | Sharon Goldwater
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The main motivation for developing contextsensitive lemmatizers is to improve performance on unseen and ambiguous words. Yet previous systems have not carefully evaluated whether the use of context actually helps in these cases. We introduce Lematus, a lemmatizer based on a standard encoder-decoder architecture, which incorporates character-level sentence context. We evaluate its lemmatization accuracy across 20 languages in both a full data setting and a lower-resource setting with 10k training examples in each language. In both settings, we show that including context significantly improves results against a context-free version of the model. Context helps more for ambiguous words than for unseen words, though the latter has a greater effect on overall performance differences between languages. We also compare to three previous context-sensitive lemmatization systems, which all use pre-extracted edit trees as well as hand-selected features and/or additional sources of information such as tagged training data. Without using any of these, our context-sensitive model outperforms the best competitor system (Lemming) in the fulldata setting, and performs on par in the lowerresource setting.

2017

pdf bib abs

From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction
Toms Bergmanis | Sharon Goldwater
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

A major motivation for unsupervised morphological analysis is to reduce the sparse data problem in under-resourced languages. Most previous work focus on segmenting surface forms into their constituent morphs (taking: tak +ing), but surface form segmentation does not solve the sparse data problem as the analyses of take and taking are not connected to each other. We present a system that adapts the MorphoChains system (Narasimhan et al., 2015) to provide morphological analyses that aim to abstract over spelling differences in functionally similar morphs. This results in analyses that are not compelled to use all the orthographic material of a word (stopping: stop +ing) or limited to only that material (acidified: acid +ify +ed). On average across six typologically varied languages our system has a similar or better F-score on EMMA (a measure of underlying morpheme accuracy) than three strong baselines; moreover, the total number of distinct morphemes identified by our system is on average 12.8% lower than for Morfessor (Virpioja et al., 2013), a state-of-the-art surface segmentation system.

pdf bib

Training Data Augmentation for Low-Resource Morphological Inflection
Toms Bergmanis | Katharina Kann | Hinrich Schütze | Sharon Goldwater
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

Toms Bergmanis

2025

2022

2021

2020

2019

2018

2017

Co-authors

Venues