uppdf
bib
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen
|
Michael Rießler
|
Eiaki V. Morooka
|
Lev Kharlashkin
pdf
bib
abs
From NLG Evaluation to Modern Student Assessment in the Era of ChatGPT: The Great Misalignment Problem and Pedagogical Multi-Factor Assessment (P-MFA)
Mika Hämäläinen
|
Kimmo Leiviskä Leiviskä
This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.
pdf
bib
abs
Benchmarking Finnish Lemmatizers across Historical and Contemporary Texts
Emily Öhman
|
Leo Huovinen
|
Mika Hämäläinen
Lemmatization is crucial in natural language processing (NLP) for languages like Finnish, where complex inflectional morphology significantly affects downstream tasks such as parsing, named entity recognition, and sentiment analysis. This study evaluates the accuracy and efficiency of several Finnish lemmatizers, utilizing the Project Gutenberg corpus, which includes diverse Finnish-language texts from different periods. Notably, this is the first study to employ Trankit for Finnish lemmatization, providing novel insights into its performance. Additionally, the integration of Murre preprocessing has been emphasized, demonstrating substantial improvements in lemmatization results. By comparing traditional and neural-network-based approaches, this paper aims to provide insights into tool selection for NLP practitioners working with Finnish based on dataset characteristics and processing constraint.
pdf
bib
abs
The world’s first South Sámi TTS – a revitalisation effort of an endangered language by reviving a legacy voice
Katri Hiovain-Asikainen
|
Thomas B. Kjærstad
|
Maja Lisa Kappfjell
|
Sjur N. Moshagen
South Sámi (ISO 639: SMA) is a severely endangered language spoken by the South Sámi people in Norway and Sweden. Estimates of the number of speakers vary from 500 to 600. Recent advances in speech technology and the general increase in popularity of spoken language and audio content have facilitated the development of modern speech technology tools also for minority languages, such as the Sámi languages. The current paper documents the development process of the world’s first South Sámi text-to-speech (TTS) system, using only digitized archive materials from 1989–1993 as the training material. To reach an end-user suitable quality of the TTS, we have used a neural, end-to-end approach with a rule-based text processing module. The aim of our project is to contribute to the language revitalization by offering tools for language users to use spoken language in new contexts. Since the modern written standard of South Sámi was established as late as in 1978, the rise of speech technology might encourage language use even for people who are not accustomed to the written standar.
pdf
bib
abs
Can advances in NLP lead to worse results for Uralic languages and how can we fight back? Experiences from the world of automatic spell-checking and correction for Finnish
Flammie A Pirinen
Spell-checking and correction is a ubiquitous application within text input in modern technology, and in some ways or another, if you type texts on a keyboard or a mobile phone, there will probably be an underlying spelling corrector running. The spell checkers have been around for decades, initially based on dictionaries and grammar rules, nowadays increasingly based on statistical data or large language models. In recent years, however, there has been a growing concern about the quality of these modern spell-checkers. In this article, we show that the spell-checkers for Finnish have gotten significantly worse in their modern implementations compared to their traditional knowledge-driven versions. We propose that this can have critical consequences for the quality of texts produced, as well as literacy overall.We furthermore speculate if it would be possible to get spell-checking and correction back on track for Uralic languages in modern systems.
pdf
bib
abs
A Hybrid Multilingual Approach to Sentiment Analysis for Uralic and Low-Resource Languages: Combining Extractive and Abstractive Techniques
Mikhail Krasitskii
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
This paper introduces a novel hybrid architecture for multilingual sentiment analysis specifically designed for morphologically complex Uralic languages. Our approach synergistically combines extractive and abstractive summarization with specialized morphological processing for agglutinative structures. The proposed model integrates dynamic thresholding mechanisms and culturally-aware attention layers, achieving statistically significant improvements of 12% accuracy for Uralic languages (p < 0.01) while outperforming state-of-the-art alternatives in summarization quality (ROUGE 1: 0.60 vs. 0.52). Key innovations include language-specific stemmers for Finno-Ugric languages and cross-Uralic transfer learning, yielding 15.7% improvement in recall while maintaining 98.2% precision. Comprehensive evaluations across multiple datasets demonstrate consistent superiority over contemporary baselines, with particular emphasis on addressing Uralic language processing challenges.
pdf
bib
abs
Language technology for the minority Finnic languages
Flammie A Pirinen
|
Trond Trosterud
|
Jack Rueter
This article gives an overview of the state of the art in language technology tools for Balto-Finnic minority languages, i.e., Balto-Finnic languages other than Estonian and Finnish. For simplicity, we will use the term Finnic in this article when referring to all members of this language branch except the Estonian and Finnish literary languages. All in all, there are nine standardised languages represented in existing language technology infrastructures with keyboards, grammatical language models, proofing tools, annotated corpora and (for one of the langauges) extensive ICALL programs. This article presents these tools and resources, discusses the relation between language models and proofing tool quality, as well as the (potential) impact of these tools on the respective language communities. The article rounds off with a discussion on prospects for future development.
pdf
bib
abs
Kildin Saami-Russian-(English) Parallel Corpus Building
Evan Hansen
This paper presents two parallel corpora of written Kildin Saami and the process of their compilation. The first, a dictionary corpus, contains 101,889 Kildin Saami tokens of example phrases/sentences from three Russian-Kildin Saami dictionaries and the glossary of the nonfiction book Saami ornaments, accompanied by the examples’ respective headwords and translations into up to four other languages. Headwords where possible are paired with their underived base, making it a suitable resource for investigating questions surrounding morphological derivation in Kildin Saami. The second corpus comprises 23,884 Kildin Saami tokens and was compiled from Saami ornaments, a trilingual (Russian-Kildin Saami-English) book introducing various Saami handicrafts and their creators from across Russian Sápmi.
pdf
bib
abs
SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
Iaroslav Chelombitko
|
Ekaterina Chelombitko
|
Aleksey Komissarov
The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k–256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the “elbow points” of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available.
pdf
bib
abs
Timur and the Mansi spellchecker
Csilla Horváth
The article presents the results of an experiment involving the use of the Mansi FST and spellchecker created by the GiellaLT infrastructure. The Mansi are one of the indigenous peoples of the Russian Federation. The Mansi language is an endangered Uralic language primarily spoken in western Siberia, along the Ob River and its tributaries. The present article discusses the efficiency of the Mansi FST and spellchecker when used for translating Mansi literature from the 1950s.
pdf
bib
abs
ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLMs
Lev Kharlashkin
|
Eiaki V. Morooka
|
Yehor Tereschenko
|
Mika Hämäläinen
ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.
pdf
bib
abs
Creating a multi-layer Treebank for Tundra Nenets
Nikolett Mus
|
Bruno Guillaume
|
Sylvain Kahane
|
Daniel Zeman
This paper presents the development of the Tundra Nenets Universal Dependencies (UD) Treebank, the first syntactically annotated resource for the Samoyedic branch of the Uralic family. The treebank integrates spoken-language data and adopts the morphologically enhanced Surface-Syntactic UD (mSUD) framework to capture inflectional morphology and morphology-based syntactic relations. It further incorporates Information Structure annotation. The methodological workflow includes data selection, transcription conventions, sentence and lexeme segmentation, annotation of spoken-language features, lemmatization, treatment of morpheme status, part-of-speech and morphological tagging, and syntactic annotation based on the functional and distributional properties of syntactic elements. We also outline the principles guiding multi-level annotation and justify the theoretical choices underlying the integration of prosodic, morphological, and syntactic information.
pdf
bib
abs
Benchmarking Large Language Models for Lemmatization and Translation of Finnic Runosongs
Lidia Pivovarova
|
Kati Kallio
|
Antti Kanner
|
Jakob Lindström
|
Eetu Mäkelä
|
Liina Saarlo
|
Kaarel Veskis
|
Mari Väina
We investigate the use of large language models (LLMs) for translation and annotation of Finnic runosongs—a highly variable multilingual poetic corpus with limited linguistic or NLP resources. We manually annotated a corpus of about 200 runosongs in a variety of languages, dialects and genres with lemmas and English translations. Using this manually annotated test set, we benchmark several large language models. We tested several prompt types and developed a collective prompt-writing methodology involving specialists from different backgrounds. Our results highlight both the potential and the limitations of current LLMs for cultural heritage NLP, and point towards strategies for prompt design, evaluation, and integration with linguistic expertise.
pdf
bib
abs
Fine-Tuning Whisper for Kildin Sami
Enzo Gamboni
For this study, Whisper, an automatic speech recognition software, was fine-tuned on Kildin Sami, an endangered and low-resource Uralic language, using an automatic speech recognition-tailored dataset of less than 30 minutes. Three different Whisper models were trained with this dataset—each one with a different base language (English, Finnish, or Russian)—to examine which model provided the best result. Results were measured using Word Error Rate; fine-tuning the Russian-base Whisper model resulted in the lowest Word Error Rate at 68.55%. While still high, this result is impressive for only a small amount of language-specific training data, and the training process yielded insights relevant for potential for further work.
pdf
bib
abs
Digitization Work at the Finno-Ugrian Society: Livonian Case Study
Niko Partanen
|
Jack Rueter
|
Valts Ernštreits
This article discusses the recent digitization project of the Finno-Ugrian Society, using the work on Livonian publications, especially those from Seppo Suhonen’s Liivin kielen näytteitä from 1975 as a case study. We start by contextualization and motivation for these undertakings, both from the point of view of the Finno-Ugrian Society and the University of Latvia Livonian Institute, and then describe the workflows we have developed and foresee for the next steps.
pdf
bib
abs
Siberian Ingrian Finnish: FST and IGTs
Ivan Ubaleht
This paper presents the current version of the finite-state transducer for the Siberian Ingrian Finnish. Our finite-state transducer uses two-level morphology. We use LexC and TwolC languages together with HFST tools to develop lexicons and phonological rules, as well as to compile the transducer. The paper also provides a description of the morphological system of Siberian Ingrian Finnish. In addition, we present a collection of interlinear glossed texts in Siberian Ingrian Finnish, provided in a machine-readable format.
pdf
bib
abs
Case–Number Dissociation in Finnish Noun Embeddings:fastText vs. BERT Layer Effects
Alexandre Nikolaev
|
Yu-Ying Chuang
|
R. Harald Baayen
Motivated by how inflectional morphology is encoded in modern embeddings, we revisit the 55,271 inflected forms from the 2,000 most frequent Finnish nouns analyzed by Nikolaev et al. (2022) using fastText and ask a single question: where does inflectional morphology emerge in BERT? For each form, we extract minimal-context FinBERT vectors from every layer (1–12) by running each word in isolation and averaging its WordPiece vectors into a single representation. Using the same generating model as in Nikolaev et al. (2022), we impute latent vectors for the stem, N UMBER, C ASE, P OSSESSIVE, and C LITIC, plus a higher-order interaction, and evaluate by rank-1 nearest correlation. Within BERT, accuracy follows an emergence curve from 67.21% (layer 1) to 86.16% (layer 12). The error mix shifts with depth: middle layers show a lower share of C ASE errors but a higher share of N UMBER errors, whereas the top layer reverses this tendency; clitic-only errors are rare throughout. For context, the fastText ceiling is slightly higher (≈89%), but our focus is the layer-resolved profile inside BERT. The result is a compact, reproducible map of Finnish noun inflection across the BERT stack, showing how different inflectional cues become recoverable at different depths (BERT layers) under an identical modeling and evaluation pipeline.
pdf
bib
abs
Evaluating OpenAI GPT Models for Translation of Endangered UralicLanguages: A Comparison of Reasoning and Non-Reasoning Architectures
Yehor Tereschenko
|
Mika Hämäläinen
|
Svitlana Myroniuk
The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI’s GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.