Jack Rueter

2025

pdf bib abs
A Mansi FST and spellchecker
Jack Rueter | Csilla Horváth | Trond Trosterud
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP

The article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.

2024

pdf bib abs
Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.

pdf bib abs
On Erzya and Moksha Corpora and Analyzer Development, ERME-PSLA 1950s
Jack Rueter | Olga Erina | Nadezhda Kabaeva
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

This paper describes materials and annotation facilitation pertinent to the «Erzya-Moksha Electronic Resources and Linguistic Diversity» (EMERALD) project. It addresses work following the construction of finite-state analyzers for the Mordvin languages, the gathering of test corpora, and the development of metadata strategies for descriptive research. In this paper, we provide three descriptors for a set of new Erzya and Moksha research materials at the Language Bank of Finland. The descriptors illustrate (1) a low-annotation subcorpora set of the «Electronic Resources for Moksha and Erzya» (ERME); (2) the state of the open-source analyzers used in their automatic annotation, and (3) the development of metadata documentation for the «EMERALD» project, associated with this endeavor. Outcomes of the article include an introduction to new research materials, an illustration of the state of the Mordvin annotation pipeline, and perspectives for the further enhancement of the annotation pipeline.

pdf bib abs
The Low Saxon LSDC Dataset at Universal Dependencies
Janine Siewert | Jack Rueter
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present an extension of the Low Saxon Universal Dependencies dataset and discuss a few annotation-related challenges. Low Saxon is a West-Germanic low-resource language that lacks a common standard and therefore poses challenges for NLP. The 1,000 sentences in our dataset cover the last 200 years and 8 of the 9 major dialects. They are presented both in original and in normalised spelling and two lemmata are provided: A Modern Low Saxon lemma and a Middle Low Saxon lemma. Several annotation-related issues result from dialectal variation in morphological categories, and we explain differences in the pronoun, gender, case, and mood system. Furthermore, we take up three syntactic constructions that do not occur in Standard Dutch or Standard German: the possessive dative, pro-drop in pronominal adverbs, and complementiser doubling in subordinate interrogative clauses. These constructions are also rare in the other Germanic UD datasets and have not always been annotated consistently.

pdf bib abs
Analyzing Pokémon and Mario Streamers’ Twitch Chat with LLM-based User Embeddings
Mika Hämäläinen | Jack Rueter | Khalid Alnajjar
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougDoug and PointCrow. Our findings suggest that each streamer has their own type of chatters, however two categories emerge for all of the streamers: supportive viewers and emoji and reaction senders. Repetitive message spammers is a shared chatter category for two of the streamers.

pdf bib abs
Investigating Multilinguality in the Plenary Sessions of the Parliament of Finland with Automatic Language Identification
Tommi Jauhiainen | Jussi Piitulainen | Erik Axelson | Ute Dieckmann | Mietta Lennes | Jyrki Niemi | Jack Rueter | Krister Lindén
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

In this paper, we use automatic language identification to investigate the usage of different languages in the plenary sessions of the Parliament of Finland. Finland has two national languages, Finnish and Swedish. The plenary sessions are published as transcriptions of speeches in Parliament, reflecting the language the speaker used. In addition to charting out language use, we demonstrate how language identification can be used to audit the quality of the dataset. On the one hand, we made slight improvements to our language identifier; on the other hand, we made a list of improvement suggestions for the next version of the dataset.

2023

pdf bib abs
Modelling the Reduplicating Lushootseed Morphology with an FST and LSTM
Jack Rueter | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

In this paper, we present an FST based approach for conducting morphological analysis, lemmatization and generation of Lushootseed words. Furthermore, we use the FST to generate training data for an LSTM based neural model and train this model to do morphological analysis. The neural model reaches a 71.9% accuracy on the test data. Furthermore, we discuss reduplication types in the Lushootseed language forms. The approach involves the use of both attested instances of reduplication and bare stems for applying a variety of reduplications to, as it is unclear just how much variation can be attributed to the individual speakers and authors of the source materials. That is, there may be areal factors that can be aligned with certain types of reduplication and their frequencies.

pdf bib abs
Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods
Mika Hämäläinen | Jack Rueter | Khalid Alnajjar | Niko Partanen
Proceedings of the Big Picture Workshop

We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are structured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica

pdf bib
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

pdf bib abs
Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Neural Machine Translation (NMT) has made significant strides in breaking down language barriers around the globe. For lesser-resourced languages like Moksha and Erzya, however, the development of robust NMT systems remains a challenge due to the scarcity of parallel corpora. This paper presents a novel approach to address this challenge by leveraging the existing rule-based machine translation system Apertium as a tool for synthetic data generation. We fine-tune NLLB-200 for Moksha-Erzya translation and obtain a BLEU of 0.73 on the Apertium generated data. On real world data, we got an improvement of 0.058 BLEU score over Apertium.

pdf bib abs
Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.

2022

pdf bib abs
Using Graph-Based Methods to Augment Online Dictionaries of Endangered Languages
Khalid Alnajjar | Mika Hämäläinen | Niko Tapio Partanen | Jack Rueter
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Latvian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, English and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionaries. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livonian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.

pdf bib
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

2021

pdf bib abs
Apurinã Universal Dependencies Treebank
Jack Rueter | Marília Fernanda Pereira de Freitas | Sidney Da Silva Facundes | Mika Hämäläinen | Niko Partanen
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features — some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.

pdf bib abs
Finnish Dialect Identification: The Effect of Audio and Text
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57%, where as text and audio reach to 85%. Our code, models and data have been released openly on Github and Zenodo.

pdf bib
Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and future
Jack Rueter | Niko Partanen | Mika Hämäläinen | Trond Trosterud
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
Linguistic change and historical periodization of Old Literary Finnish
Niko Partanen | Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola. We analyse the error types that occur and appear in different decades, and use word error rate (WER) and different error types as a proxy for measuring linguistic innovation and change. We show that the proposed approach works, and the errors are connected to accumulating changes and innovations, which also results in a continuous decrease in the accuracy of the model. The described error types also guide further work in improving these models, and document the currently observed issues. We also have trained word embeddings for four centuries of lemmatized Old Literary Finnish, which are available on Zenodo.

pdf bib
Proceedings of the Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

pdf bib abs
Processing M.A. Castrén’s Materials: Multilingual Historical Typed and Handwritten Manuscripts
Niko Partanen | Jack Rueter | Khalid Alnajjar | Mika Hämäläinen
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813–1852). The Finno-Ugrian Society is publishing Castrén’s manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.

pdf bib abs
Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97.2%. Our results suggest that the performance difference is due to a difference in the original training data. Furthermore, we find that a regular LSTM model works better than one trained with a pretrained word2vec model. These findings suggest that more work needs to be done for pretrained models in Finnish language as they have been trained on small and biased corpora.

pdf bib abs
Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
Mika Hämäläinen | Niko Partanen | Jack Rueter | Khalid Alnajjar
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.

pdf bib
Numerals and what counts
Jack Rueter | Niko Partanen | Flammie A. Pirinen
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

pdf bib abs
Detecting Depression in Thai Blog Posts: a Dataset and a Baseline
Mika Hämäläinen | Pattama Patpong | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53% accuracy with a Thai BERT model in detecting depression. This establishes a good baseline for future researcher on the same corpus. Furthermore, we identify a need for Thai embeddings that have been trained on a more varied corpus than Wikipedia. Our corpus, code and trained models have been released openly on Zenodo.

2020

pdf bib abs
Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter | Niko Partanen
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

We present an open-source online dictionary editing system, Ve′rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.

pdf bib
On the questions in developing computational infrastructure for Komi-Permyak
Jack Rueter | Niko Partanen | Larisa Ponomareva
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
On Editing Dictionaries for Uralic Languages in an Online Environment
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
Open-Source Morphology for Endangered Mordvinic Languages
Jack Rueter | Mika Hämäläinen | Niko Partanen
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.

pdf bib abs
FST Morphology for the Endangered Skolt Sami Language
Jack Rueter | Mika Hämäläinen
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.

2019

pdf bib abs
Revisiting NMT for Normalization of Early English Letters
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.

pdf bib
Finding Sami Cognates with a Character-Based NMT Approach
Mika Hämäläinen | Jack Rueter
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf bib abs
Morphosyntactic Disambiguation in an Endangered Language Setting
Jeff Ens | Mika Hämäläinen | Jack Rueter | Philippe Pasquier
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sentence, and their development is labour intensive. We present an LSTM based model for automatically ranking morphological readings of sentences based on their quality. This ranking can be used to evaluate the existing CG disambiguators or to directly morphologically disambiguate sentences. Our approach works on a morphological abstraction and it can be trained with a very small dataset.

pdf bib
Survey of Uralic Universal Dependencies development
Niko Partanen | Jack Rueter
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib
Combining Concepts and Their Translations from Structured Dictionaries of Uralic Minority Languages
Mika Hämäläinen | Liisa Lotta Tarvainen | Jack Rueter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages
Tommi A. Pirinen | Michael Rießler | Jack Rueter | Trond Trosterud | Francis M. Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Development of an Open Source Natural Language Generation Tool for Finnish
Mika Hämäläinen | Jack Rueter
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib
Towards an open-source universal-dependency treebank for Erzya
Jack Rueter | Francis Tyers
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
Normalizing Early English Letters to Present-day English Spelling
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.

Jack Rueter

2025

2024

2023

2022

2021

2020

2019

2018

2017

2015

Co-authors

Venues