International Workshop on Computational Linguistics for Uralic Languages (2024)


up

pdf (full)
bib (full)
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

pdf bib
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Flammie Pirinen | Melany Macias | Mario Crespo Avila

pdf bib
Aspect Based Sentiment Analysis of Finnish Neighborhoods: Insights from Suomi24
Laleh Davoodi | Anssi Öörni | Ville Harkke

This study presents an approach to Aspect-Based Sentiment Analysis (ABSA) using Natural Language Processing (NLP) techniques to explore public sentiment across 12 suburban neighborhoods in Finland. We employed and compared a range of machine learning models for sentiment classification, with the RoBERTa model emerging as the best performer. Using RoBERTa, we conducted a comprehensive sentiment analysis(SA) on a manually annotated dataset and a predicted dataset comprising 32,183 data points to investigate sentiment trends over time in these areas. The results provide insights into fluctuations in public sentiment, highlighting both the robustness of the RoBERTa model and significant shifts in sentiment for specific neighborhoods over time. This research contributes to a deeper understanding of neighborhood sentiment dynamics in Finland, with potential implications for social research and urban development.

pdf bib
Political Stance Detection in Estonian News Media
Lauri Lüüsi | Uku Kangur | Roshni Chakraborty | Rajesh Sharma

Newspapers have always remained an important medium for disseminating information to the masses. With continuous access and availability of news, there is a severe competition among news media agencies to attract user attention. Therefore, ensuring fairness in news reporting, such as, politically stance neutral reporting has become more crucial than before. Although several research studies have explored and detected political stance in English news articles, there is a lack of research focusing on low-resource languages like Estonian. To address this gap, this paper examines the effectiveness of established stance-detection features that have been successful for English news media, while also proposing novel features tailored specifically for Estonian. Our study consists of 32 different features comprising of lexical, Estonian-specific, framing and sentiment-related features out of which we identify 15 features as useful for stance detection.

pdf bib
Universal-WER: Enhancing WER with Segmentation and Weighted Substitution for Varied Linguistic Contexts
Samy Ouzerrout

Word Error Rate (WER) is a crucial metric for evaluating the performance of automatic speech recognition (ASR) systems. However, its traditional calculation, based on Levenshtein distance, does not account for lexical similarity between words and treats each substitution in a binary manner, while also ignoring segmentation errors. This paper proposes an improvement to WER by introducing a weighted substitution method, based on lexical similarity measures, and incorporating splitting and merging operations to better handle segmentation errors. Unlike other WER variants, our approach is easily integrable and generalizable to various languages, providing a more nuanced and accurate evaluation of ASR transcriptions, particularly for morphologically complex or low-resource languages.

pdf bib
DAG: Dictionary-Augmented Generation for Disambiguation of Sentences in Endangered Uralic Languages using ChatGPT
Mika Hämäläinen

We showcase that ChatGPT can be used to disambiguate lemmas in two endangered languages ChatGPT is not proficient in, namely Erzya and Skolt Sami. We augment our prompt by providing dictionary translations of the candidate lemmas to a majority language - Finnish in our case. This dictionary augmented generation approach results in 50% accuracy for Skolt Sami and 41% accuracy for Erzya. On a closer inspection, many of the error types were of the kind even an untrained human annotator would make.

pdf bib
Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter

This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.

pdf bib
Multilingual Approaches to Sentiment Analysis of Texts in Linguistically Diverse Languages: A Case Study of Finnish, Hungarian, and Bulgarian
Mikhail Krasitskii | Olga Kolesnikova | Liliana Chanona Hernandez | Grigori Sidorov | Alexander Gelbukh

This article is dedicated to the study of multilingual approaches to sentiment analysis of texts in Finnish, Hungarian, and Bulgarian. For Finnish and Hungarian, which are characterized by complex morphology and agglutinative grammar, an analysis was conducted using both traditional rule-based methods and modern machine learning techniques. In the study, BERT, XLM-R, and mBERT models were used for sentiment analysis, demonstrating high accuracy in sentiment classification. The inclusion of Bulgarian was motivated by the opportunity to compare results across languages with varying degrees of morphological complexity, which allowed for a better understanding of how these models can adapt to different linguistic structures. Datasets such as the Hungarian Emotion Corpus, FinnSentiment, and SentiFi were used to evaluate model performance. The results showed that transformer-based models, particularly BERT, XLM-R, and mBERT, significantly outperformed traditional methods, achieving high accuracy in sentiment classification tasks for all the languages studied.

pdf bib
Towards standardized inflected lexicons for the Finnic languages
Jules Bouton

We introduce three richly annotated lexicons of nouns for Livonian, standard Finnish and Livvi Karelian. Our datasets are distributed in the machine-readable Paralex standard, which consists of linked CSV tables described in a JSON metadata file. We built on the morphological dictionary of Livonian, the VepKar database and the Omorfi software to provide inflected forms. All noun forms were transcribed with grapheme-to-phoneme conversion rules and the paradigms annotated for both overabundance and defectivity. The resulting datasets are usable for quantitative studies of morphological systems and for qualitative investigations. They are linked to the original resources and can be easily updated.

pdf bib
On Erzya and Moksha Corpora and Analyzer Development, ERME-PSLA 1950s
Jack Rueter | Olga Erina | Nadezhda Kabaeva

This paper describes materials and annotation facilitation pertinent to the «Erzya-Moksha Electronic Resources and Linguistic Diversity» (EMERALD) project. It addresses work following the construction of finite-state analyzers for the Mordvin languages, the gathering of test corpora, and the development of metadata strategies for descriptive research. In this paper, we provide three descriptors for a set of new Erzya and Moksha research materials at the Language Bank of Finland. The descriptors illustrate (1) a low-annotation subcorpora set of the «Electronic Resources for Moksha and Erzya» (ERME); (2) the state of the open-source analyzers used in their automatic annotation, and (3) the development of metadata documentation for the «EMERALD» project, associated with this endeavor. Outcomes of the article include an introduction to new research materials, an illustration of the state of the Mordvin annotation pipeline, and perspectives for the further enhancement of the annotation pipeline.

pdf bib
Towards the speech recognition for Livonian
Valts Ernštreits

This article outlines the path toward the development of speech synthesis and speech recognition technologies for Livonian, a critically endangered Uralic language with around 20 contemporary fluent speakers. It presents the rationale behind the creation of these technologies and introduces the hypotheses and planned approaches to achieve this goal. The article discusses the four-stage approach of leveraging existing data and multiplying voice data through speech synthesis and voice cloning to generate the necessary data for building and training speech recognition for Livonian.

pdf bib
Using Large Language Models to Transliterate Endangered Uralic Languages
Niko Partanen

This study investigates whether the Large Language Models are able to transliterate and normalize endangered Uralic languages, specifically when they have been written in early 20th century Latin script based transcription systems. We test commercially available closed source systems where there is no reason to expect that the models would be particularly adjusted to this task or these languages. The output of the transliteration in all experiments is contemporary Cyrillic orthography. We conclude that some of the newer LLMs, especially Claude 3.5 Sonnet, are able to produce high quality transliterations even in the smaller languages in our test set, both in zero-shot scenarios and with a prompt that contains an example of the desired output. We assume that the good result is connected to the large presence of materials in these languages online, which the LLM has learned to represent.

pdf bib
Specialized Monolingual BPE Tokenizers for Uralic Languages Representation in Large Language Models
Iaroslav Chelombitko | Aleksey Komissarov

Large language models show significant inequality in language representation, particularly for Uralic languages. Our analysis found that existing tokenizers allocate minimal tokens to Uralic languages, highlighting this imbalance. To address this, we developed a pipeline to create clean monolingual datasets from Wikipedia articles for four Uralic languages. We trained Byte Pair Encoding (BPE) tokenizers with a vocabulary size of 256,000 tokens, though Northern Sami had only 93,187 due to limited data. Our findings revealed most tokens are unique to each language, with 8,102 shared across all four, and 25,876 shared among Estonian, Finnish, and Hungarian. Using the Compression Ratio metric, our tokenizers outperformed popular ones like LLaMA-2 and Gemma 2, reducing Finnish’s compression ratio from 3.41 to 1.18. These results demonstrate the importance of specialized tokenizers for underrepresented languages, improving model performance and lowering costs. By sharing our tokenizers and datasets, we provide crucial resources for further research, emphasizing the need for equitable language representation.

pdf bib
Compressing Noun Phrases to Discover Mental Constructions in Corpora – A Case Study for Auxiliaries in Hungarian
Balázs Indig | Tímea Borbála Bajzát

The quantitative turn in functional linguistics has emphasised the importance of data-oriented methods in describing linguistic patterns. However, there are significant differences between constructions and the examples they cover, which need to be properly formalised. For example, noun chains introduce significant variation in the examples, making it difficult to identify underlying patterns. The compression of noun chains into their minimal form (e.g. as they appear in abstract constructions) is a promising method for revealing linguistic patterns in corpora through their examples. This method, combined with identifying the appropriate level of abstraction for the additional elements present, allows for the systematic extraction of good construction candidates. A pilot has been developed for Hungarian infinitive structures, but is adaptable for various linguistic structures and other agglutinative languages.

pdf bib
On Erzya and Moksha Corpora and Analyzer Development, ERME-PSLA 1950s
Aleksei Dorkin | Taido Purason | Kairit Sirts

Adapting multilingual language models to specific languages can enhance both their efficiency and performance. In this study, we explore how modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance on the Named Entity Recognition (NER) task. The motivations for adjusting the vocabulary are twofold: practical benefits affecting the computational cost, such as reducing the input sequence length and the model size, and performance enhancements by tailoring the vocabulary to the particular language. We evaluate the effectiveness of two vocabulary adaptation approaches—retraining the tokenizer and pruning unused tokens—and assess their impact on the model’s performance, particularly after continual training. While retraining the tokenizer degraded the performance of the NER task, suggesting that longer embedding tuning might be needed, we observed no negative effects on pruning.

pdf bib
On the Role of New Technologies in the Documentation and Revitalization of Uralic Languages of Russia in Historical and Contemporary Contexts
Alexander Nazarenko

The Uralic languages spoken in Russia face significant challenges due to historical and sociopolitical factors, resulting in their endangered status. While only Finnish, Estonian, and Hungarian enjoy solid support as official languages, most Uralic languages suffer from limited resources and declining speaker populations. This paper examines the development of written Uralic languages, the impact of Russian language and its writing system to them, and the consequences of the lack of state interest in these languages for preservation efforts. Despite these challenges, technological advancements present valuable opportunities for revitalization. Existing projects, such as dictionaries and language corpora, highlight both the potential and shortcomings of current linguistic resources. Innovative approaches, including AI-based applications and user-driven platforms, can enhance engagement among people. By emphasizing the importance of high-quality linguistic data, this study advocates for a more proactive and collaborative effort in the preservation and promotion of Uralic languages.

pdf bib
Applying the transformer architecture on the task of headline selection for Finnish news texts
Maria Adamova | Maria Khokhlova

The paper evaluates the possibilities of using transformer architecture in creating headlines for news texts in Finnish. The authors statistically analyse the original and generated headlines according to three criteria: informativeness, relevance and impact. The study also substantiates for the first time the effectiveness of a fine-tuned text-to-text transfer transformer model within the task of generating headlines for news articles in Finnish. The results show that there is no statistically significant difference between the scores obtained by the original and generated headlines on the mentioned criteria of informativeness, relevance and impact.

pdf bib
Keeping Up Appearances—or how to get all Uralic languages included into bleeding edge research and software: generate, convert, and LLM your way into multilingual datasets
Flammie A Pirinen

The current trends in natural language processing strongly favor large language models and generative AIs as the basis for everything. For Uralic languages that are not largely present in publically available data on the Internet, this can be problematic. In the current computational linguistic scene, it is very important to have representation of your language in popular datasets. Languages that are included in well-known datasets are also included in shared tasks, products by large technology corporations, and so forth. This inclusion will become especially important for under-resourced, under-studied minority, and Indigenous languages, which will otherwise be easily forgotten. In this article, we present the resources that are often deemed necessary for digital presence of a language in the large language model obsessed world of today. We show that there are methods and tricks available to alleviate the problems with a lack of data and a lack of creators and annotators of the data, some more successful than others.

pdf bib
Scaling Sustainable Development Goal Predictions across Languages: From English to Finnish
Melany Macias | Lev Kharlashkin, | Leo Huovinen | Mika Hämäläinen

In this paper, we leverage an exclusive English dataset to train diverse multilingual classifiers, investigating their efficacy in adapting to Finnish data. We employ an exclusively English classification dataset of UN Sustainable Development Goals (SDG) in an education context, to train various multilingual classifiers and examine how well these models can adapt to recognizing the same classes within Finnish university course descriptions. It’s worth noting that Finnish, with a mere 5 million native speakers, presents a significantly less-resourced linguistic context compared to English. The best performing model in our experiments was mBART with an F1-score of 0.843.

pdf bib
Kola Saami Christian Text Corpus
Michael Rießler

Christian texts have been known to be printed in Kola Saami languages since 1828; the most extensive publication is the Gospel of Matthew, different translations of which have been published three times since 1878, most recently in 2022. The Lord’s Prayer was translated in several more versions in Kildin Saami and Skolt Saami, first in 1828. All of these texts seem to go back to translations from Rus- sian. Such characteristics make these pub- lications just right for parallel text align- ment. This paper describes ongoing work with building a Kola Saami Christian Text Cor- pus, including conceptional and technical decisions. Thus, it describes a resource, rather than a study. However, compu- tational studies based on these data will hopefully take place in the near future, af- ter the Kildin Saami subset of this corpus is finished and published by the end of 2024. In addition to computation, this resource will also allow for comparative linguistic studies on diachronic and synchronic vari- ation and change in Kola Saami languages, which are among the most endangered and least described Uralic languages.