Mikko Kurimo

2025

pdf bib
Proceedings of AAAS Workshop 2025 – Automatic Assessment of Atypical Speech
Mikko Kurimo | Tamas Grosz
Proceedings of AAAS Workshop 2025 – Automatic Assessment of Atypical Speech

pdf bib
Leveraging Uncertainty for Finnish L2 Speech Scoring with LLMs
Ekaterina Voskoboinik | Nhan Phan | Tamás Grósz | Mikko Kurimo
Proceedings of AAAS Workshop 2025 – Automatic Assessment of Atypical Speech

pdf bib abs
Towards large-scale speech foundation models for a low-resource minority language
Yaroslav Getman | Tamás Grósz | Katri Hiovain-Asikainen | Tommi Lehtonen | Mikko Kurimo
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Modern ASR systems require massive amounts of training data. While ASR training data for most languages are scarce and expensive to transcribe, a practical solution is to collect huge amounts of raw untranscribed speech and pre-train the ASR model in a self-supervised manner. Unfortunately, for many low-resource minority languages, even untranscribed speech data are scarce. In this paper, we propose a solution for the Northern Sámi language with 22,400 hours of speech extracted from the Finnish radio and television archives. We evaluated the model performance with different decoding algorithms and examined the models’ internal behavior with interpretation-based techniques.

2024

pdf bib abs
LLMs’ morphological analyses of complex FST-generated Finnish words
Anssi Moisio | Mathias Creutz | Mikko Kurimo
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Rule-based language processing systems have been overshadowed by neural systems in terms of utility, but it remains unclear whether neural NLP systems, in practice, learn the grammar rules that humans use. This work aims to shed light on the issue by evaluating state-of-the-art LLMs in a task of morphological analysis of complex Finnish noun forms. We generate the forms using an FST tool, and they are unlikely to have occurred in the training sets of the LLMs, therefore requiring morphological generalisation capacity. We find that GPT-4-turbohas some difficulties in the task while GPT-3.5-turbo struggles and smaller models Llama2-70B and Poro-34B fail nearly completely.

pdf bib
Improved Spoken Emotion Recognition With Combined Segment-Based Processing And Triplet Loss
Dejan Porjazovski | Tamas Grosz | Mikko Kurimo
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

This paper reports on the experience collecting a number of corpora of Nordic languages spoken by children. The aim of the data collection is providing annotated data to develop and evaluate computer assisted pronunciation assessment systems both for non-native children learning a Nordic language (L2) and for L1 children with speech sound disorder (SSD). The paper presents the challenges encountered recording and annotating data for Finnish, Swedish and Norwegian, as well as the ethical considerations related with making this data publicly available. We hope that sharing this experience will encourage others to collect similar data for other languages. Of the different data collections, we were able to make the Norwegian corpus publicly available in the hope that it will serve as a reference in pronunciation assessment research.

2023

pdf bib abs
On using distribution-based compositionality assessment to evaluate compositional generalisation in machine translation
Anssi Moisio | Mathias Creutz | Mikko Kurimo
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

Compositional generalisation (CG), in NLP and in machine learning more generally, has been assessed mostly using artificial datasets. It is important to develop benchmarks to assess CG also in real-world natural language tasks in order to understand the abilities and limitations of systems deployed in the wild. To this end, our GenBench Collaborative Benchmarking Task submission utilises the distribution-based compositionality assessment (DBCA) framework to split the Europarl translation corpus into a training and a test set in such a way that the test set requires compositional generalisation capacity. Specifically, the training and test sets have divergent distributions of dependency relations, testing NMT systems’ capability of translating dependencies that they have not been trained on. This is a fully-automated procedure to create natural language compositionality benchmarks, making it simple and inexpensive to apply it further to other datasets and languages. The code and data for the experiments is available at https://github.com/aalto-speech/dbca.

pdf bib
Automated Assessment of Task Completion in Spontaneous Speech for Finnish and Finland Swedish Language Learners
Ekaterina Voskoboinik | Yaroslav Getman | Ragheb Al-Ghezi | Mikko Kurimo | Tamas Grosz
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

pdf bib abs
CaptainA - A mobile app for practising Finnish pronunciation
Nhan Phan | Tamás Grósz | Mikko Kurimo
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Learning a new language is often difficult, especially practising it independently. The main issue with self-study is the absence of accurate feedback from a teacher, which would enable students to learn unfamiliar languages. In recent years, with advances in Artificial Intelligence and Automatic Speech Recognition, it has become possible to build applications that can provide valuable feedback on the users’ pronunciation. In this paper, we introduce the CaptainA app explicitly developed to aid students in practising their Finnish pronunciation on handheld devices. Our app is a valuable resource for immigrants who are busy with school or work, and it helps them integrate faster into society. Furthermore, by providing this service for L2 speakers and collecting their data, we can continuously improve our system and provide better aid in the future.

pdf bib abs
Evaluating Morphological Generalisation in Machine Translation by Distribution-Based Compositionality Assessment
Anssi Moisio | Mathias Creutz | Mikko Kurimo
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Compositional generalisation refers to the ability to understand and generate a potentially infinite number of novel meanings using a finite group of known primitives and a set of rules to combine them. The degree to which artificial neural networks can learn this ability is an open question. Recently, some evaluation methods and benchmarks have been proposed to test compositional generalisation, but not many have focused on the morphological level of language. We propose an application of the previously developed distribution-based compositionality assessment method to assess morphological generalisation in NLP tasks, such as machine translation or paraphrase detection. We demonstrate the use of our method by comparing translation systems with different BPE vocabulary sizes. The evaluation method we propose suggests that small vocabularies help with morphological generalisation in NMT.

2022

pdf bib abs
When to Laugh and How Hard? A Multimodal Approach to Detecting Humor and Its Intensity
Khalid Alnajjar | Mika Hämäläinen | Jörg Tiedemann | Jorma Laaksonen | Mikko Kurimo
Proceedings of the 29th International Conference on Computational Linguistics

Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of it. We use the prerecorded laughter in the show as annotation as it marks humor and the length of the audience’s laughter tells us how funny a given joke is. We evaluate the model on episodes the model has not been exposed to during the training phase. Our results show that the model is capable of correctly detecting whether an utterance is humorous 78% of the time and how long the audience’s laughter reaction should last with a mean absolute error of 600 milliseconds.

pdf bib abs
Semiautomatic Speech Alignment for Under-Resourced Languages
Juho Leinonen | Niko Partanen | Sami Virpioja | Mikko Kurimo
Proceedings of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference

Cross-language forced alignment is a solution for linguists who create speech corpora for very low-resource languages. However, cross-language is an additional challenge making a complex task, forced alignment, even more difficult. We study how linguists can impart domain expertise to the tasks to increase the performance of automatic forced aligners while keeping the time effort still lower than with manual forced alignment. First, we show that speech recognizers have a clear bias in starting the word later than a human annotator, which results in micro-pauses in the results that do not exist in manual alignments, and study which is the best way to automatically remove these silences. Second, we ask the linguists to simplify the task by splitting long interview audios into shorter lengths by providing some manually aligned segments and evaluating the results of this process. We also study how correlated source language performance is to target language performance, since often it is an easier task to find a better source model than to adapt to the target language.

pdf bib abs
Morfessor-enriched features and multilingual training for canonical morphological segmentation
Aku Rouhe | Stig-Arne Grönroos | Sami Virpioja | Mathias Creutz | Mikko Kurimo
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

In our submission to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation, we study whether an unsupervised morphological segmentation method, Morfessor, can help in a supervised setting. Previous research has shown the effectiveness of the approach in semisupervised settings with small amounts of labeled data. The current tasks vary in data size: the amount of word-level annotated training data is much larger, but the amount of sentencelevel annotated training data remains small. Our approach is to pre-segment the input data for a neural sequence-to-sequence model with the unsupervised method. As the unsupervised method can be trained with raw text data, we use Wikipedia to increase the amount of training data. In addition, we train multilingual models for the sentence-level task. The results for the Morfessor-enriched features are mixed, showing benefit for all three sentencelevel tasks but only some of the word-level tasks. The multilingual training yields considerable improvements over the monolingual sentence-level models, but it negates the effect of the enriched features.

2021

pdf bib abs
Speaker Verification Experiments for Adults and Children Using Shared Embedding Spaces
Tuomas Kaseva | Hemant Kumar Kathania | Aku Rouhe | Mikko Kurimo
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

For children, the system trained on a large corpus of adult speakers performed worse than a system trained on a much smaller corpus of children’s speech. This is due to the acoustic mismatch between training and testing data. To capture more acoustic variability we trained a shared system with mixed data from adults and children. The shared system yields the best EER for children with no degradation for adults. Thus, the single system trained with mixed data is applicable for speaker verification for both adults and children.

pdf bib abs
Spectral modification for recognition of children’s speech undermismatched conditions
Hemant Kumar Kathania | Sudarsana Reddy Kadiri | Paavo Alku | Mikko Kurimo
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

In this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize children’s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched condition, the ASR performance is degraded due to the acoustic and linguistic mismatch in the attributes between children and adult speakers. The proposed method is used to improve the speech intelligibility to enhance the children’s speech recognition using an acoustic model trained on adult speech. In the experiments, WSJCAM0 and PFSTAR are used as databases for adults’ and children’s speech, respectively. The proposed technique gives a significant improvement in the context of the DNN-HMM-based ASR. Furthermore, we validate the robustness of the technique by showing that it performs well also in mismatched noise conditions.

pdf bib abs
Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages
Juho Leinonen | Sami Virpioja | Mikko Kurimo
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Forced alignment is an effective process to speed up linguistic research. However, most forced aligners are language-dependent, and under-resourced languages rarely have enough resources to train an acoustic model for an aligner. We present a new Finnish grapheme-based forced aligner and demonstrate its performance by aligning multiple Uralic languages and English as an unrelated language. We show that even a simple non-expert created grapheme-to-phoneme mapping can result in useful word alignments.

2020

pdf bib abs
Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Twelfth Language Resources and Evaluation Conference

Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.

pdf bib abs
Effects of Language Relatedness for Cross-lingual Transfer Learning in Character-Based Language Models
Mittul Singh | Peter Smit | Sami Virpioja | Mikko Kurimo
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Character-based Neural Network Language Models (NNLM) have the advantage of smaller vocabulary and thus faster training times in comparison to NNLMs based on multi-character units. However, in low-resource scenarios, both the character and multi-character NNLMs suffer from data sparsity. In such scenarios, cross-lingual transfer has improved multi-character NNLM performance by allowing information transfer from a source to the target language. In the same vein, we propose to use cross-lingual transfer for character NNLMs applied to low-resource Automatic Speech Recognition (ASR). However, applying cross-lingual transfer to character NNLMs is not as straightforward. We observe that relatedness of the source language plays an important role in cross-lingual pretraining of character NNLMs. We evaluate this aspect on ASR tasks for two target languages: Finnish (with English and Estonian as source) and Swedish (with Danish, Norwegian, and English as source). Prior work has observed no difference between using the related or unrelated language for multi-character NNLMs. We, however, show that for character-based NNLMs, only pretraining with a related language improves the ASR performance, and using an unrelated language may deteriorate it. We also observe that the benefits are larger when there is much lesser target data than source data.

pdf bib abs
Graph-based Syntactic Word Embeddings
Ragheb Al-Ghezi | Mikko Kurimo
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

We propose a simple and efficient framework to learn syntactic embeddings based on information derived from constituency parse trees. Using biased random walk methods, our embeddings not only encode syntactic information about words, but they also capture contextual information. We also propose a method to train the embeddings on multiple constituency parse trees to ensure the encoding of global syntactic representation. Quantitative evaluation of the embeddings show a competitive performance on POS tagging task when compared to other types of embeddings, and qualitative evaluation reveals interesting facts about the syntactic typology learned by these embeddings.

pdf bib abs
Service registration chatbot: collecting and comparing dialogues from AMT workers and service’s users
Luca Molteni | Mittul Singh | Juho Leinonen | Katri Leino | Mikko Kurimo | Emanuele Della Valle
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Crowdsourcing is the go-to solution for data collection and annotation in the context of NLP tasks. Nevertheless, crowdsourced data is noisy by nature; the source is often unknown and additional validation work is performed to guarantee the dataset’s quality. In this article, we compare two crowdsourcing sources on a dialogue paraphrasing task revolving around a chatbot service. We observe that workers hired on crowdsourcing platforms produce lexically poorer and less diverse rewrites than service users engaged voluntarily. Notably enough, on dialogue clarity and optimality, the two paraphrase sources’ human-perceived quality does not differ significantly. Furthermore, for the chatbot service, the combined crowdsourced data is enough to train a transformer-based Natural Language Generation (NLG) system. To enable similar services, we also release tools for collecting data and training the dialogue-act-based transformer-based NLG module.

2019

pdf bib
North Sámi morphological segmentation with low-resource semi-supervised sequence labeling
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages

pdf bib abs
A user study to compare two conversational assistants designed for people with hearing impairments
Anja Virkkunen | Juri Lukkarila | Kalle Palomäki | Mikko Kurimo
Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies

Participating in conversations can be difficult for people with hearing loss, especially in acoustically challenging environments. We studied the preferences the hearing impaired have for a personal conversation assistant based on automatic speech recognition (ASR) technology. We created two prototypes which were evaluated by hearing impaired test users. This paper qualitatively compares the two based on the feedback obtained from the tests. The first prototype was a proof-of-concept system running real-time ASR on a laptop. The second prototype was developed for a mobile device with the recognizer running on a separate server. In the mobile device, augmented reality (AR) was used to help the hearing impaired observe gestures and lip movements of the speaker simultaneously with the transcriptions. Several testers found the systems useful enough to use in their daily lives, with majority preferring the mobile AR version. The biggest concern of the testers was the accuracy of the transcriptions and the lack of speaker identification.

2018

pdf bib abs
The MeMAD Submission to the IWSLT 2018 Speech Translation Task
Umut Sulubacak | Jörg Tiedemann | Aku Rouhe | Stig-ArneGrönroos | Mikko Kurimo
Proceedings of the 15th International Conference on Spoken Language Translation

This paper describes the MeMAD project entry to the IWSLT Speech Translation Shared Task, addressing the translation of English audio into German text. Between the pipeline and end-to-end model tracks, we participated only in the former, with three contrastive systems. We tried also the latter, but were not able to finish our end-to-end model in time. All of our systems start by transcribing the audio into text through an automatic speech recognition (ASR) model trained on the TED-LIUM English Speech Recognition Corpus (TED-LIUM). Afterwards, we feed the transcripts into English-German text-based neural machine translation (NMT) models. Our systems employ three different translation models trained on separate training sets compiled from the English-German part of the TED Speech Translation Corpus (TED-TRANS) and the OPENSUBTITLES2018 section of the OPUS collection. In this paper, we also describe the experiments leading up to our final systems. Our experiments indicate that using OPENSUBTITLES2018 in training significantly improves translation performance. We also experimented with various preand postprocessing routines for the NMT module, but we did not have much success with these. Our best-scoring system attains a BLEU score of 16.45 on the test set for this year’s task.

pdf bib
New Baseline in Automatic Speech Recognition for Northern Sámi
Juho Leinonen | Peter Smit | Sami Virpioja | Mikko Kurimo
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs
Cognate-aware morphological segmentation for multilingual neural translation
Stig-Arne Grönroos | Sami Virpioja | Mikko Kurimo
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This article describes the Aalto University entry to the WMT18 News Translation Shared Task. We participate in the multilingual subtrack with a system trained under the constrained condition to translate from English to both Finnish and Estonian. The system is based on the Transformer model. We focus on improving the consistency of morphological segmentation for words that are similar orthographically, semantically, and distributionally; such words include etymological cognates, loan words, and proper names. For this, we introduce Cognate Morfessor, a multilingual variant of the Morfessor method. We show that our approach improves the translation quality particularly for Estonian, which has less resources for training the translation model.

This paper describes the MeMAD project entry to the WMT Multimodal Machine Translation Shared Task. We propose adapting the Transformer neural machine translation (NMT) architecture to a multi-modal setting. In this paper, we also describe the preliminary experiments with text-only translation systems leading us up to this choice. We have the top scoring system for both English-to-German and English-to-French, according to the automatic metrics for flickr18. Our experiments show that the effect of the visual features in our system is small. Our largest gains come from the quality of the underlying text-only NMT system. We find that appropriate use of additional data is effective.

String segmentation is an important and recurring problem in natural language processing and other domains. For morphologically rich languages, the amount of different word forms caused by morphological processes like agglutination, compounding and inflection, may be huge and causes problems for traditional word-based language modeling approach. Segmenting text into better modelable units is thus an important part of the modeling task. This work presents methods and a toolkit for learning segmentation models from text. The methods may be applied to lexical unit selection for speech recognition and also other segmentation tasks.

pdf bib
Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy
Miikka Silfverberg | Teemu Ruokolainen | Krister Lindén | Mikko Kurimo
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib abs
Studies on training text selection for conversational Finnish language modeling
Seppo Enarvi | Mikko Kurimo
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

Current ASR and MT systems do not operate on conversational Finnish, because training data for colloquial Finnish has not been available. Although speech recognition performance on literary Finnish is already quite good, those systems have very poor baseline performance in conversational speech. Text data for relevant vocabulary and language models can be collected from the Internet, but web data is very noisy and most of it is not helpful for learning good models. Finnish language is highly agglutinative, and written phonetically. Even phonetic reductions and sandhi are often written down in informal discussions. This increases vocabulary size dramatically and causes word-based selection methods to fail. Our selection method explicitly optimizes the perplexity of a subword language model on the development data, and requires only very limited amount of speech transcripts as development data. The language models have been evaluated for speech recognition using a new data set consisting of generic colloquial Finnish.

pdf bib
Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields
Teemu Ruokolainen | Oskar Kohonen | Sami Virpioja | Mikko Kurimo
Proceedings of the Seventeenth Conference on Computational Natural Language Learning