Mārcis Pinnis

Also published as: Marcis Pinnis

2025

Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States
Jurgita Kapočiūtė-Dzikienė | Toms Bergmanis | Mārcis Pinnis
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defense, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight large language models support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama 3, Gemma 2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma 2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.

2024

pdf bib abs

Code-Mixed Text Augmentation for Latvian ASR
Martins Kronis | Askars Salimbajevs | Mārcis Pinnis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Code-mixing has become mainstream in the modern, globalised world and affects low-resource languages, such as Latvian, in particular. Solutions to developing an automatic speech recognition system (ASR) for code-mixed speech often rely on specially created audio-text corpora, which are expensive and time-consuming to create. In this work, we attempt to tackle code-mixed Latvian-English speech recognition by improving the language model (LM) of a hybrid ASR system. We make a distinction between inflected transliterations and phonetic transcriptions as two different foreign word types. We propose an inflected transliteration model and a phonetic transcription model for the automatic generation of said word types. We then leverage a large human-translated English-Latvian parallel text corpus to generate synthetic code-mixed Latvian sentences by substituting in generated foreign words. Using the newly created augmented corpora, we train a new LM and combine it with our existing Latvian acoustic model (AM). For evaluation, we create a specialised foreign word test set on which our methods yield up to 15% relative CER improvement. We then further validate these results in a human evaluation campaign.

2022

We present the MTee project - a research initiative funded via an Estonian public procurement to develop machine translation technology that is open-source and free of charge. The MTee project delivered an open-source platform serving state-of-the-art machine translation systems supporting four domains for six language pairs translating from Estonian into English, German, and Russian and vice-versa. The platform also features grammatical error correction and speech translation for Estonian and allows for formatted document translation and automatic domain detection. The software, data and training workflows for machine translation engines are all made publicly available for further use and research.

pdf bib abs

Open Terminology Management and Sharing Toolkit for Federation of Terminology Databases
Andis Lagzdiņš | Uldis Siliņš | Toms Bergmanis | Mārcis Pinnis | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Consolidated access to current and reliable terms from different subject fields and languages is necessary for content creators and translators. Terminology is also needed in AI applications such as machine translation, speech recognition, information extraction, and other natural language processing tools. In this work, we facilitate standards-based sharing and management of terminology resources by providing an open terminology management solution - the EuroTermBank Toolkit. It allows organisations to manage and search their terms, create term collections, and share them within and outside the organisation by participating in the network of federated databases. The data curated in the federated databases are automatically shared with EuroTermBank, the largest multilingual terminology resource in Europe, allowing translators and language service providers as well as researchers and students to access terminology resources in their most current version.

pdf bib abs

In this work, we present the work that has been carried on in the MT4All CEF project and the resources that it has generated by leveraging recent research carried out in the field of unsupervised learning. In the course of the project 18 monolingual corpora for specific domains and languages have been collected, and 12 bilingual dictionaries and translation models have been generated. As part of the research, the unsupervised MT methodology based only on monolingual corpora (Artetxe et al., 2017) has been tested on a variety of languages and domains. Results show that in specialised domains, when there is enough monolingual in-domain data, unsupervised results are comparable to those of general domain supervised translation, and that, at any rate, unsupervised techniques can be used to boost results whenever very little data is available.

2021

pdf bib abs

Dynamic Terminology Integration for COVID-19 and Other Emerging Domains
Toms Bergmanis | Mārcis Pinnis
Proceedings of the Sixth Conference on Machine Translation

The majority of language domains require prudent use of terminology to ensure clarity and adequacy of information conveyed. While the correct use of terminology for some languages and domains can be achieved by adapting general-purpose MT systems on large volumes of in-domain parallel data, such quantities of domain-specific data are seldom available for less-resourced languages and niche domains. Furthermore, as exemplified by COVID-19 recently, no domain-specific parallel data is readily available for emerging domains. However, the gravity of this recent calamity created a high demand for reliable translation of critical information regarding pandemic and infection prevention. This work is part of WMT2021 Shared Task: Machine Translation using Terminologies, where we describe Tilde MT systems that are capable of dynamic terminology integration at the time of translation. Our systems achieve up to 94% COVID-19 term use accuracy on the test set of the EN-FR language pair without having access to any form of in-domain information during system training.

pdf bib abs

Facilitating Terminology Translation with Target Lemma Annotations
Toms Bergmanis | Mārcis Pinnis
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are given in their dictionary forms; finding the right target language form is part of the translation process. We argue that the requirement for apriori specified target language forms is unrealistic and impedes the practical applicability of previous work. In this work, we propose to train machine translation systems using a source-side data augmentation method that annotates randomly selected source language words with their target language lemmas. We show that systems trained on such augmented data are readily usable for terminology integration in real-life translation scenarios. Our experiments on terminology translation into the morphologically complex Baltic and Uralic languages show an improvement of up to 7 BLEU points over baseline systems with no means for terminology integration and an average improvement of 4 BLEU points over the previous work. Results of the human evaluation indicate a 47.7% absolute improvement over the previous work in term translation accuracy when translating into Latvian.

bib abs

From Research to Production: Fine-Grained Analysis of Terminology Integration
Toms Bergmanis | Mārcis Pinnis | Paula Reichenberg
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

Dynamic terminology integration in neural machine translation (NMT) is a sought-after feature of computer-aided translation tools among language service providers and small to medium businesses. Despite the recent surge in research on terminology integration in NMT, it still is seldom or inadequately supported in commercial machine translation solutions. In this presentation, we will share our experience of developing and deploying terminology integration capabilities for NMT systems in production. We will look at the three core tasks of terminology integration: terminology management, terminology identification, and translation with terminology. This talk will be insightful for NMT system developers, translators, terminologists, and anyone interested in translation projects.

bib abs

The Neural Translation for the European Union (NTEU) engine farm enables direct machine translation for all 24 official languages of the European Union without the necessity to use a high-resourced language as a pivot. This amounts to a total of 552 translation engines for all combinations of the 24 languages. We have collected parallel data for all the language combinations publickly shared in elrc-share.eu. The translation engines have been customized to domain,for the use of the European public administrations. The delivered engines will be published in the European Language Grid. In addition to the usual automatic metrics, all the engines have been evaluated by humans based on the direct assessment methodology. For this purpose, we built an open-source platform called MTET The evaluation shows that most of the engines reach high quality and get better scores compared to an external machine translation service in a blind evaluation setup.

2020

pdf bib abs

Tilde at WMT 2020: News Task Systems
Rihards Krišlauks | Mārcis Pinnis
Proceedings of the Fifth Conference on Machine Translation

This paper describes Tilde’s submission to the WMT2020 shared task on news translation for both directions of the English-Polish language pair in both the constrained and the unconstrained tracks. We follow our submissions form the previous years and build our baseline systems to be morphologically motivated sub-word unit-based Transformer base models that we train using the Marian machine translation toolkit. Additionally, we experiment with different parallel and monolingual data selection schemes, as well as sampled back-translation. Our final models are ensembles of Transformer base and Transformer big models which feature right-to-left re-ranking.

pdf bib

A Tale of Eight Countries or the EU Council Presidency Translator in Retrospect
Mārcis Pinnis | Toms Bergmanis | Kristīne Metuzāle | Valters Šics | Artūrs Vasiļevskis | Andrejs Vasiļjevs
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib abs

The Neural Translation for the European Union (NTEU) project aims to build a neural engine farm with all European official language combinations for eTranslation, without the necessity to use a high-resourced language as a pivot. NTEU started in September 2019 and will run until August 2021.

pdf bib abs

Mitigating Gender Bias in Machine Translation with Target Gender Annotations
Artūrs Stafanovičs | Toms Bergmanis | Mārcis Pinnis
Proceedings of the Fifth Conference on Machine Translation

When translating “The secretary asked for details.” to a language with grammatical gender, it might be necessary to determine the gender of the subject “secretary”. If the sentence does not contain the necessary information, it is not always possible to disambiguate. In such cases, machine translation systems select the most common translation option, which often corresponds to the stereotypical translations, thus potentially exacerbating prejudice and marginalisation of certain groups and people. We argue that the information necessary for an adequate translation can not always be deduced from the sentence being translated or even might depend on external knowledge. Therefore, in this work, we propose to decouple the task of acquiring the necessary information from the task of learning to translate correctly when such information is available. To that end, we present a method for training machine translation systems to use word-level annotations containing information about subject’s gender. To prepare training data, we annotate regular source language words with grammatical gender information of the corresponding target language words. Using such data to train machine translation systems reduces their reliance on gender stereotypes when information about the subject’s gender is available. Our experiments on five language pairs show that this allows improving accuracy on the WinoMT test set by up to 25.8 percentage points.

pdf bib

Customized Neural Machine Translation Systems for the Swiss Legal Domain
Rubén Martínez-Domínguez | Matīss Rikters | Artūrs Vasiļevskis | Mārcis Pinnis | Paula Reichenberg
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

2019

pdf bib

pdf bib abs

We present a portfolio of natural legal language processing and document curation services currently under development in a collaborative European project. First, we give an overview of the project and the different use cases, while, in the main part of the article, we focus upon the 13 different processing services that are being deployed in different prototype applications using a flexible and scalable microservices architecture. Their orchestration is operationalised using a content and document curation workflow manager.

pdf bib abs

Tilde’s Machine Translation Systems for WMT 2019
Marcis Pinnis | Rihards Krišlauks | Matīss Rikters
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

The paper describes the development process of Tilde’s NMT systems for the WMT 2019 shared task on news translation. We trained systems for the English-Lithuanian and Lithuanian-English translation directions in constrained and unconstrained tracks. We build upon the best methods of the previous year’s competition and combine them with recent advancements in the field. We also present a new method to ensure source domain adherence in back-translated data. Our systems achieved a shared first place in human evaluation.

2018

pdf bib

Tilde MT Platform for Developing Client Specific MT Solutions
Mārcis Pinnis | Andrejs Vasiļjevs | Rihards Kalniņš | Roberts Rozis | Raivis Skadiņš | Valters Šics
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Training and Adapting Multilingual NMT for Less-resourced and Morphologically Rich Languages
Matīss Rikters | Mārcis Pinnis | Rihards Krišlauks
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Developing a Neural Machine Translation Service for the 2017-2018 European Union Presidency
Mārcis Pinnis | Rihards Kalnins
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)

pdf bib abs

Tilde’s Machine Translation Systems for WMT 2018
Mārcis Pinnis | Matīss Rikters | Rihards Krišlauks
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The paper describes the development process of the Tilde’s NMT systems that were submitted for the WMT 2018 shared task on news translation. We describe the data filtering and pre-processing workflows, the NMT system training architectures, and automatic evaluation results. For the WMT 2018 shared task, we submitted seven systems (both constrained and unconstrained) for English-Estonian and Estonian-English translation directions. The submitted systems were trained using Transformer models.

pdf bib abs

Tilde’s Parallel Corpus Filtering Methods for WMT 2018
Mārcis Pinnis
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The paper describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde’s submissions to the WMT 2018 shared task on parallel corpus filtering.

2017

pdf bib abs

NMT or SMT: Case Study of a Narrow-domain English-Latvian Post-editing Project
Inguna Skadiņa | Mārcis Pinnis
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The recent technological shift in machine translation from statistical machine translation (SMT) to neural machine translation (NMT) raises the question of the strengths and weaknesses of NMT. In this paper, we present an analysis of NMT and SMT systems’ outputs from narrow domain English-Latvian MT systems that were trained on a rather small amount of data. We analyze post-edits produced by professional translators and manually annotated errors in these outputs. Analysis of post-edits allowed us to conclude that both approaches are comparably successful, allowing for an increase in translators’ productivity, with the NMT system showing slightly worse results. Through the analysis of annotated errors, we found that NMT translations are more fluent than SMT translations. However, errors related to accuracy, especially, mistranslation and omission errors, occur more often in NMT outputs. The word form errors, that characterize the morphological richness of Latvian, are frequent for both systems, but slightly fewer in NMT outputs.

pdf bib

Tilde’s Machine Translation Systems for WMT 2017
Mārcis Pinnis | Rihards Krišlauks | Toms Miks | Daiga Deksne | Valters Šics
Proceedings of the Second Conference on Machine Translation

2016

pdf bib

What Can We Really Learn from Post-editing?
Marcis Pinnis | Rihards Kalnins | Raivis Skadins | Inguna Skadina
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf bib abs

Designing a Speech Corpus for the Development and Evaluation of Dictation Systems in Latvian
Mārcis Pinnis | Askars Salimbajevs | Ilze Auziņa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper the authors present a speech corpus designed and created for the development and evaluation of dictation systems in Latvian. The corpus consists of over nine hours of orthographically annotated speech from 30 different speakers. The corpus features spoken commands that are common for dictation systems for text editors. The corpus is evaluated in an automatic speech recognition scenario. Evaluation results in an ASR dictation scenario show that the addition of the corpus to the acoustic model training data in combination with language model adaptation allows to decrease the WER by up to relative 41.36% (or 16.83% in absolute numbers) compared to a baseline system without language model adaptation. Contribution of acoustic data augmentation is at relative 12.57% (or 3.43% absolute).

2015

pdf bib

Dynamic Terminology Integration Methods in Statistical Machine Translation
Marcis Pinnis
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib

Dynamic Terminology Integration Methods in Statistical Machine Translation
Mārcis Pinnis
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

pdf bib abs

Bilingual dictionaries for all EU languages
Ahmet Aker | Monica Paramita | Mārcis Pinnis | Robert Gaizauskas
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the quality of outputs of tools relying on the dictionaries are negatively affected. In this work we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and translation based approach. We have applied these approaches on the GIZA++ dictionaries – dictionaries covering official EU languages – in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download.

pdf bib abs

Terminology localization guidelines for the national scenario
Juris Borzovs | Ilze Ilziņa | Iveta Keiša | Mārcis Pinnis | Andrejs Vasiļjevs
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a set of principles and practical guidelines for terminology work in the national scenario to ensure a harmonized approach in term localization. These linguistic principles and guidelines are elaborated by the Terminology Commission in Latvia in the domain of Information and Communication Technology (ICT). We also present a novel approach in a corpus-based selection and an evaluation of the most frequently used terms. Analysis of the terms proves that, in general, in the normative terminology work in Latvia localized terms are coined according to these guidelines. We further evaluate how terms included in the database of official terminology are adopted in the general use such as newspaper articles, blogs, forums, websites etc. Our evaluation shows that in a non-normative context the official terminology faces a strong competition from other variations of localized terms. Conclusions and recommendations from lexical analysis of localized terms are provided. We hope that presented guidelines and approach in evaluation will be useful to terminology institutions, regulative authorities and researchers in different countries that are involved in the national terminology work.

bib

Machine translation for e-government – the Baltic case
Andrejs Vasiļjevs | Rihards Kalniņš | Mārcis Pinnis | Raivis Skadiņš
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Users Track

pdf bib abs

Designing the Latvian Speech Recognition Corpus
Mārcis Pinnis | Ilze Auziņa | Kārlis Goba
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper the authors present the first Latvian speech corpus designed specifically for speech recognition purposes. The paper outlines the decisions made in the corpus designing process through analysis of related work on speech corpora creation for different languages. The authors provide also guidelines that were used for the creation of the Latvian speech recognition corpus. The corpus creation guidelines are fairly general for them to be re-used by other researchers when working on different language speech recognition corpora. The corpus consists of two parts ― an orthographically annotated corpus containing 100 hours of orthographically transcribed audio data and a phonetically annotated corpus containing 4 hours of phonetically transcribed audio data. Metadata files in XML format provide additional details about the speakers, noise levels, speech styles, etc. The speech recognition corpus is phonetically balanced and phonetically rich and the paper describes also the methodology how the phonetical balancedness has been assessed.

pdf bib

Application of machine translation in localization into low-resourced languages
Raivis Skadiņš | Mārcis Pinnis | Andrejs Vasiļjevs | Inguna Skadiņa | Tomas Hudik
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

bib

Real-world challenges in application of MT for localization: the Baltic case
Mārcis Pinnis | Raivis Skadiņš | Andrejs Vasiļjevs
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Users Track

2013

pdf bib

Context Independent Term Mapper for European Languages
Mārcis Pinnis
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib

Application of Online Terminology Services in Statistical Machine Translation
Raivis Skadins | Marcis Pinnis | Tatiana Gornostay | Andrejs Vasiljevs
Proceedings of Machine Translation Summit XIV: Posters

2012

pdf bib abs

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

pdf bib abs

Latvian and Lithuanian Named Entity Recognition with TildeNER
Mārcis Pinnis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper the author presents TildeNER ― an open source freely available named entity recognition toolkit and the first multi-class named entity recognition system for Latvian and Lithuanian languages. The system is built upon a supervised conditional random field classifier and features heuristic and statistical refinement methods that improve supervised classification, thus boosting the overall system's performance. The toolkit provides means for named entity recognition model bootstrapping, plaintext document and also pre-processed (morpho-syntactically tagged) tab-separated document named entity tagging and evaluation on test data. The paper presents the design of the system, describes the most important data formats and briefly discusses extension possibilities to different languages. It also gives evaluation on human annotated gold standard test corpora for Latvian and Lithuanian languages as well as comparative performance analysis to a state-of-the art English named entity recognition system using parallel and strongly comparable corpora. The author gives analysis of the Latvian and Lithuanian named entity tagged corpora annotation process and the created named entity annotated corpora.

pdf bib