Olli Kuparinen

2026

Effects of Speaker Bias in Dialect Identification and Automatic Transcription with Self-Supervised Speech Models
Olli Kuparinen
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

A major issue in audio modeling is speaker bias, in which the models learn language external traits, such as a speaker’s timbre or pitch, and use this information as a shortcut to a language task. This is especially problematic for dialectology, as it is typical in dialect corpora that only a few speakers represent a complete dialect area. In this paper, we explore the effects of speaker bias in two dialectal tasks: dialect identification and automatic dialectal transcription. We build two different data partitions of dialect interviews in Finnish and Norwegian: 1) a speaker dependent partition in which all of the speakers appear in training, development, and test sets, and 2) a speaker independent partition where each speaker only appears in exactly one set. We further experiment with modifications of the training data by augmenting the original audio with pitch shifts and noise, as well as changing the original speakers’ voices with voice conversion models. We show that the dialect identification models are highly affected by speaker bias, whereas automatic dialectal transcription models are not. The audio modifications do not offer major performance gains for either of the languages or tasks.

2025

pdf bib abs

Interactive maps for corpus-based dialectology
Yves Scherrer | Olli Kuparinen
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Traditional data collection methods in dialectology rely on structured surveys, whose results can be easily presented on printed or digital maps. But in recent years, corpora of transcribed dialect speech have become a precious alternative data source for data-driven linguistic analysis. For example, topic models can be advantageously used to discover both general dialectal variation patterns and specific linguistic features that are most characteristic for certain dialects. Multilingual (or rather, multilectal) language modeling tasks can also be used to learn speaker-specific embeddings. In connection with this paper, we introduce a website that presents the results of two recent studies in the form of interactive maps, allowing visitors to explore the effects of various parameter settings. The website covers two tasks (topic models and speaker embeddings) and three language areas (Finland, Norway, and German-speaking Switzerland). It is available at https://www.corcodial.net/ .

2024

pdf bib abs

Murre24: Dialect Identification of Finnish Internet Forum Messages
Olli Kuparinen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper presents Murre24, a collection of dialectal messages posted on the largest Finnish internet forum, Suomi24. The messages posted in Finnish on the forum between 2001 and 2020 are classified to present either the standard language, one of the seven traditional dialects, a colloquial style or the Helsinki slang. We present a manually annotated dataset used to train dialect identification models as well as the automatic annotation of almost 94 million messages in total. We experiment with five different dialect identification methods and evaluate them on dialectally balanced and random test samples. The best performing method for differentiating standard Finnish from non-standard Finnish is a character n-gram based support vector machine (SVM), while fine-tuning a BERT-based model achieves best scores in the final dialect identification task. According to the automatic classification, most of the messages written on the forum are in standard Finnish, and most of the non-standard messages are in a colloquial variety used typically by young speakers in Finland. We moreover show that the proportion of non-standard messages declines over time, but the proportion of the traditional dialects stays relatively steady.

2023

pdf bib abs

Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets
Olli Kuparinen
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.

pdf bib abs

Dialect Representation Learning with Neural Dialect-to-Standard Normalization
Olli Kuparinen | Yves Scherrer
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Language label tokens are often used in multilingual neural language modeling and sequence-to-sequence learning to enhance the performance of such models. An additional product of the technique is that the models learn representations of the language tokens, which in turn reflect the relationships between the languages. In this paper, we study the learned representations of dialects produced by neural dialect-to-standard normalization models. We use two large datasets of typologically different languages, namely Finnish and Norwegian, and evaluate the learned representations against traditional dialect divisions of both languages. We find that the inferred dialect embeddings correlate well with the traditional dialects. The methodology could be further used in noisier settings to find new insights into language variation.

pdf bib abs

Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
Olli Kuparinen | Aleksandra Miletić | Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2023

Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization – i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety – as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a character-level Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.

pdf bib abs

CorCoDial - Machine translation techniques for corpus-based computational dialectology
Yves Scherrer | Olli Kuparinen | Aleksandra Miletic
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper presents CorCoDial, a research project funded by the Academy of Finland aiming to leverage machine translation technology for corpus-based computational dialectology. In this paper, we briefly present intermediate results of our project-related research.

pdf bib abs

The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline
Yves Scherrer | Aleksandra Miletić | Olli Kuparinen
Proceedings of ArabicNLP 2023

The Helsinki-NLP team participated in the NADI 2023 shared tasks on Arabic dialect translation with seven submissions. We used statistical (SMT) and neural machine translation (NMT) methods and explored character- and subword-based data preprocessing. Our submissions placed second in both tracks. In the open track, our winning submission is a character-level SMT system with additional Modern Standard Arabic language models. In the closed track, our best BLEU scores were obtained with the leave-as-is baseline, a simple copy of the input, and narrowly followed by SMT systems. In both tracks, fine-tuning existing multilingual models such as AraT5 or ByT5 did not yield superior performance compared to SMT.

Co-authors

Venues