Inguna Skadiņa

Also published as: Inguna Skadin̨a, Inguna Skadina

2026

Teaching NLP in the AI Era: Experiences from the University of Latvia
Inguna Skadina | Guntis Barzdins | Uldis Bojārs | Normunds Gruzitis | Pēteris Paikens
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)

From being a niche technology with practical applications in translation and speech recognition, NLP is now underpinning the AI era through LLMs, promising a universal economic impact in the future. Although transitioning to the AI era is hyped by BigTech companies, practical adoption of the LLM capabilities for economically impactful tasks and processes goes via education of specialists capable to apply it properly. Human-in-the-loop, accuracy measurement, fine-tuning, on-premises processing of sensitive data have become essential skills for applying NLP. This short paper introduces two language technology modules developed and piloted at the Faculty of Science and Technology of the University of Latvia.

pdf bib abs

Language Technology Initiative: Framework for Teaching NLP and Computational Linguistics at the Universities in Latvia
Inguna Skadina | Jana Kuzmina | Marina Platonova | Tatjana Smirnova | Sergei Kruk
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)

This short paper provides an overview of language technology related modules and courses developed at three leading universities of Latvia - University of Latvia (UL), Riga Technical University (RTU) and Riga Stradiņš University (RSU).

2025

pdf bib abs

Anonymise: A Tool for Multilingual Document Pseudonymisation
Rinalds Vīksna | Inguna Skadina
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

According to the EU legislation, documents containing personal information need to be anonymized before public sharing. However, manual anonymisation is a time-consuming and costly process. Thus, there is a need for a robust text de-identification technique that accurately identifies and replaces personally identifiable information. This paper introduces the Anonymise tool, a system for document de-identification. The tool accepts text documents of various types (e.g., MS Word, plain-text), de-identifies personal information, and saves the de-identified document in its original format. The tool employs a modular architecture, integrating list-based matching, regular expressions and deep-learning-based named entity recognition to detect spans for redaction. Our evaluation results demonstrate high recall rates, making Anonymise a reliable solution for ensuring no sensitive information is left exposed. The tool can be accessed through a userfriendly web-based interface or API, offering flexibility for both individual and large-scale document processing needs. By automating document de-identification with high accuracy and efficiency, Anonymise presents a reliable solution for ensuring compliance with EU privacy regulations while reducing the time and cost associated with manual anonymisation.

pdf bib abs

First Steps in Benchmarking Latvian in Large Language Models
Inguna Skadina | Bruno Bakanovs | Roberts Darģis
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

The performance of multilingual large language models (LLMs) in low-resource languages, such as Latvian, has been under-explored. In this paper, we investigate the capabilities of several open and commercial LLMs in the Latvian language understanding tasks. We evaluate these models across several well-known benchmarks, such as the Choice of Plausible Alternatives (COPA) and Measuring Massive Multitask Language Understanding (MMLU), which were adapted into Latvian using machine translation. Our results highlight significant variability in model performance, emphasizing the challenges of extending LLMs to low-resource languages. We also analyze the effect of post-editing on machine-translated datasets, observing notable improvements in model accuracy, particularly with BERT-based architectures. We also assess open-source LLMs using the Belebele dataset, showcasing competitive performance from open-weight models when compared to proprietary systems. This study reveals key insights into the limitations of current LLMs in low-resource settings and provides datasets for future benchmarking efforts.

2024

pdf bib abs

Evaluating Open-Source LLMs in Low-Resource Languages: Insights from Latvian High School Exams
Roberts Darģis | Guntis Bārzdiņš | Inguna Skadiņa | Normunds Grūzītis | Baiba Saulīte
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

The latest large language models (LLM) have significantly advanced natural language processing (NLP) capabilities across various tasks. However, their performance in low-resource languages, such as Latvian with 1.5 million native speakers, remains substantially underexplored due to both limited training data and the absence of comprehensive evaluation benchmarks. This study addresses this gap by conducting a systematic assessment of prominent open-source LLMs on natural language understanding (NLU) and natural language generation (NLG) tasks in Latvian. We utilize standardized high school centralized graduation exams as a benchmark dataset, offering relatable and diverse evaluation scenarios that encompass multiple-choice questions and complex text analysis tasks. Our experimental setup involves testing models from the leading LLM families, including Llama, Qwen, Gemma, and Mistral, with OpenAI’s GPT-4 serving as a performance reference. The results reveal that certain open-source models demonstrate competitive performance in NLU tasks, narrowing the gap with GPT-4. However, all models exhibit notable deficiencies in NLG tasks, specifically in generating coherent and contextually appropriate text analyses, highlighting persistent challenges in NLG for low-resource languages. These findings contribute to efforts to develop robust multilingual benchmarks and improve LLM performance in diverse linguistic contexts.

pdf bib abs

MultiLeg: Dataset for Text Sanitisation in Less-resourced Languages
Rinalds Vīksna | Inguna Skadiņa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Text sanitization is the task of detecting and removing personal information from the text. While it has been well-studied in monolingual settings, today, there is also a need for multilingual text sanitization. In this paper, we introduce MultiLeg: a parallel, multilingual named entity (NE) dataset consisting of documents from the Court of Justice of the European Union annotated with semantic categories suitable for text sanitization. The dataset is available in 8 languages, and it contains 3082 parallel text segments for each language. We also show that the pseudonymized dataset remains useful for downstream tasks.

2023

pdf bib abs

Large Language Models for Multilingual Slavic Named Entity Linking
Rinalds Vīksna | Inguna Skadiņa | Daiga Deksne | Roberts Rozis
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)

This paper describes our submission for the 4th Shared Task on SlavNER on three Slavic languages - Czech, Polish and Russian. We use pre-trained multilingual XLM-R Language Model (Conneau et al., 2020) and fine-tune it for three Slavic languages using datasets provided by organizers. Our multilingual NER model achieves 0.896 F-score on all corpora, with the best result for Czech (0.914) and the worst for Russian (0.880). Our cross-language entity linking module achieves F-score of 0.669 in the official SlavNER 2023 evaluation.

2022

pdf bib abs

Assessing Multilinguality of Publicly Accessible Websites
Rinalds Vīksna | Inguna Skadiņa | Raivis Skadiņš | Andrejs Vasiļjevs | Roberts Rozis
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Although information on the Internet can be shared in many languages, the language presence on the World Wide Web is very disproportionate. The problem of multilingualism on the Web, in particular access, availability and quality of information in the world’s languages, has been the subject of UNESCO focus for several decades. Making European websites more multilingual is also one of the focal targets of the Connecting Europe Facility Automated Translation (CEF AT) digital service infrastructure. In order to monitor this goal, alongside other possible solutions, CEF AT needs a methodology and easy to use tool to assess the degree of multilingualism of a given website. In this paper we investigate methods and tools that automatically analyse the language diversity of the Web and propose indicators and methodology on how to measure the multilingualism of European websites. We also introduce a prototype tool based on open-source software that helps to assess multilingualism of the Web and can be independently run at set intervals. We also present initial results obtained with our tool that allows us to conclude that multilingualism on the Web is still a problem not only at the world level, but also at the European and regional level.

LNCC is a diverse collection of Latvian language corpora representing both written and spoken language and is useful for both linguistic research and language modelling. The collection is intended to cover diverse Latvian language use cases and all the important text types and genres (e.g. news, social media, blogs, books, scientific texts, debates, essays, etc.), taking into account both quality and size aspects. To reach this objective, LNCC is a continuous multi-institutional and multi-project effort, supported by the Digital Humanities and Language Technology communities in Latvia. LNCC includes a broad range of Latvian texts from the Latvian National Library, Culture Information Systems Centre, Latvian National News Agency, Latvian Parliament, Latvian web crawl, various Latvian publishers, and from the Latvian language corpora created by Institute of Mathematics and Computer Science and its partners, including spoken language corpora. All corpora of LNCC are re-annotated with a uniform morpho-syntactic annotation scheme which enables federated search and consistent linguistics analysis in all the LNCC corpora, as well as facilitates to select and mix various corpora for pre-training large Latvian language models like BERT and GPT.

2021

pdf bib abs

Domain Expert Platform for Goal-Oriented Dialog Collection
Didzis Goško | Arturs Znotins | Inguna Skadina | Normunds Gruzitis | Gunta Nešpore-Bērzkalne
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Today, most dialogue systems are fully or partly built using neural network architectures. A crucial prerequisite for the creation of a goal-oriented neural network dialogue system is a dataset that represents typical dialogue scenarios and includes various semantic annotations, e.g. intents, slots and dialogue actions, that are necessary for training a particular neural network architecture. In this demonstration paper, we present an easy to use interface and its back-end which is oriented to domain experts for the collection of goal-oriented dialogue samples. The platform not only allows to collect or write sample dialogues in a structured way, but also provides a means for simple annotation and interpretation of the dialogues. The platform itself is language-independent; it depends only on the availability of particular language processing components for a specific language. It is currently being used to collect dialogue samples in Latvian (a highly inflected language) which represent typical communication between students and the student service.

pdf bib abs

Multilingual Slavic Named Entity Recognition
Rinalds Vīksna | Inguna Skadina
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

Named entity recognition, in particular for morphological rich languages, is challenging task due to the richness of inflected forms and ambiguity. This challenge is being addressed by SlavNER Shared Task. In this paper we describe system submitted to this task. Our system uses pre-trained multilingual BERT Language Model and is fine-tuned for six Slavic languages of this task on texts distributed by organizers. In our experiments this multilingual NER model achieved 96 F1 score on in-domain data and an F1 score of 83 on out of domain data. Entity coreference module achieved F1 score of 47.6 as evaluated by bsnlp2021 organizers.

2020

pdf bib abs

This paper presents the key results of a study on the global competitiveness of the European Language Technology market for three areas – Machine Translation, speech technology, and cross-lingual search. EU competitiveness is analyzed in comparison to North America and Asia. The study focuses on seven dimensions (research, innovations, investments, market dominance, industry, infrastructure, and Open Data) that have been selected to characterize the language technology market. The study concludes that while Europe still has strong positions in Research and Innovation, it lags behind North America and Asia in scaling innovations and conquering market share.

pdf bib abs

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

2019

pdf bib

2017

pdf bib abs

NMT or SMT: Case Study of a Narrow-domain English-Latvian Post-editing Project
Inguna Skadiņa | Mārcis Pinnis
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The recent technological shift in machine translation from statistical machine translation (SMT) to neural machine translation (NMT) raises the question of the strengths and weaknesses of NMT. In this paper, we present an analysis of NMT and SMT systems’ outputs from narrow domain English-Latvian MT systems that were trained on a rather small amount of data. We analyze post-edits produced by professional translators and manually annotated errors in these outputs. Analysis of post-edits allowed us to conclude that both approaches are comparably successful, allowing for an increase in translators’ productivity, with the NMT system showing slightly worse results. Through the analysis of annotated errors, we found that NMT translations are more fluent than SMT translations. However, errors related to accuracy, especially, mistranslation and omission errors, occur more often in NMT outputs. The word form errors, that characterize the morphological richness of Latvian, are frequent for both systems, but slightly fewer in NMT outputs.

2016

pdf bib

What Can We Really Learn from Post-editing?
Marcis Pinnis | Rihards Kalnins | Raivis Skadins | Inguna Skadina
Conferences of the Association for Machine Translation in the Americas: MT Users' Track

pdf bib abs

Syntax-based Multi-system Machine Translation
Matīss Rikters | Inguna Skadiņa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes a hybrid machine translation system that explores a parser to acquire syntactic chunks of a source sentence, translates the chunks with multiple online machine translation (MT) system application program interfaces (APIs) and creates output by combining translated chunks to obtain the best possible translation. The selection of the best translation hypothesis is performed by calculating the perplexity for each translated chunk. The goal of this approach is to enhance the baseline multi-system hybrid translation (MHyT) system that uses only a language model to select best translation from translations obtained with different APIs and to improve overall English ― Latvian machine translation quality over each of the individual MT APIs. The presented syntax-based multi-system translation (SyMHyT) system demonstrates an improvement in terms of BLEU and NIST scores compared to the baseline system. Improvements reach from 1.74 up to 2.54 BLEU points.

2014

pdf bib abs

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.

pdf bib

Application of machine translation in localization into low-resourced languages
Raivis Skadiņš | Mārcis Pinnis | Andrejs Vasiļjevs | Inguna Skadiņa | Tomas Hudik
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf bib abs

CLARA (Common Language Resources and Their Applications) is a Marie Curie Initial Training Network which ran from 2009 until 2014 with the aim of providing researcher training in crucial areas related to language resources and infrastructure. The scope of the project was broad and included infrastructure design, lexical semantic modeling, domain modeling, multimedia and multimodal communication, applications, and parsing technologies and grammar models. An international consortium of 9 partners and 12 associate partners employed researchers in 19 new positions and organized a training program consisting of 10 thematic courses and summer/winter schools. The project has resulted in new theoretical insights as well as new resources and tools. Most importantly, the project has trained a new generation of researchers who can perform advanced research and development in language resources and technologies.

2013

pdf bib

2012

pdf bib abs

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

pdf bib

2011

pdf bib

Evaluation of SMT in localization to under-resourced inflected language
Raivis Skadiņš | Maris Puriņš | Inguna Skadiņa | Andrejs Vasiļjevs
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

pdf bib

META-NORD: Towards Sharing of Language Resources in Nordic and Baltic Countries
Inguna Skadiņa | Andrejs Vasiļjevs | Lars Borin | Koenraad De Smedt | Krister Lindén | Eiríkur Rögnvaldsson
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm

pdf bib

Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
Bolette Sandford Pedersen | Gunta Nešpore | Inguna Skadiņa
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib abs

Towards Improving English-Latvian Translation: A System Comparison and a New Rescoring Feature
Maxim Khalilov | José A. R. Fonollosa | Inguna Skadin̨a | Edgars Brālītis | Lauma Pretkalnin̨a
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Translation into the languages with relatively free word order has received a lot less attention than translation into fixed word order languages (English), or into analytical languages (Chinese). At the same time this translation task is found among the most difficult challenges for machine translation (MT), and intuitively it seems that there is some space in improvement intending to reflect the free word order structure of the target language. This paper presents a comparative study of two alternative approaches to statistical machine translation (SMT) and their application to a task of English-to-Latvian translation. Furthermore, a novel feature intending to reflect the relatively free word order scheme of the Latvian language is proposed and successfully applied on the n-best list rescoring step. Moving beyond classical automatic scores of translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems output that helps to shed light on advantages and disadvantages of the SMT systems under consideration.

Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.

2009

pdf bib

English-Latvian SMT: knowledge or data?
Inguna Skadiņa | Edgars Brālītis
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib

English–Latvian Toponym Processing: Translation Strategies and Linguistic Patterns
Tatiana Gornostay | Inguna Skadiņa
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf bib

Pattern-based English-Latvian Toponym Translation
Tatiana Gornostay | Inguna Skadiņa
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib abs

Dictionary of Multiword Expressions for Translation into highly Inflected Languages
Daiga Deksne | Raivis Skadiņš | Inguna Skadiņa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Treatment of Multiword Expressions (MWEs) is one of the most complicated issues in natural language processing, especially in Machine Translation (MT). The paper presents dictionary of MWEs for a English-Latvian MT system, demonstrating a way how MWEs could be handled for inflected languages with rich morphology and rather free word order. The proposed dictionary of MWEs consists of two constituents: a lexicon of phrases and a set of MWE rules. The lexicon of phrases is rather similar to translation lexicon of the MT system, while MWE rules describe syntactic structure of the source and target sentence allowing correct transformation of different MWE types into the target language and ensuring correct syntactic structure. The paper demonstrates this approach on different MWE types, starting from simple syntactic structures, followed by more complicated cases and including fully idiomatic expressions. Automatic evaluation shows that the described approach increases the quality of translation by 0.6 BLEU points.

Inguna Skadiņa

2026

2025

2024

2023

2022

2021

2020

2019

2017

2016

2014

2013

2012

2011

2010

2009

2008

2007

Co-authors

Venues