Mika Hämäläinen

2025

Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Michael Rießler | Eiaki V. Morooka | Lev Kharlashkin
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

pdf bib abs

Benchmarking Finnish Lemmatizers across Historical and Contemporary Texts
Emily Öhman | Leo Huovinen | Mika Hämäläinen
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

Lemmatization is crucial in natural language processing (NLP) for languages like Finnish, where complex inflectional morphology significantly affects downstream tasks such as parsing, named entity recognition, and sentiment analysis. This study evaluates the accuracy and efficiency of several Finnish lemmatizers, utilizing the Project Gutenberg corpus, which includes diverse Finnish-language texts from different periods. Notably, this is the first study to employ Trankit for Finnish lemmatization, providing novel insights into its performance. Additionally, the integration of Murre preprocessing has been emphasized, demonstrating substantial improvements in lemmatization results. By comparing traditional and neural-network-based approaches, this paper aims to provide insights into tool selection for NLP practitioners working with Finnish based on dataset characteristics and processing constraint.

pdf bib abs

Threefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs
Mohammed Alnajjar | Khalid Alnajjar | Mika Hämäläinen
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI’s potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

pdf bib

Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | Yuri Bizzoni | So Miyagawa | Khalid Alnajjar
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

pdf bib abs

A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient
Yehor Tereshchenko | Mika Hämäläinen
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).

pdf bib abs

Evaluating OpenAI GPT Models for Translation of Endangered UralicLanguages: A Comparison of Reasoning and Non-Reasoning Architectures
Yehor Tereschenko | Mika Hämäläinen | Svitlana Myroniuk
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI’s GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.

pdf bib abs

From NLG Evaluation to Modern Student Assessment in the Era of ChatGPT: The Great Misalignment Problem and Pedagogical Multi-Factor Assessment (P-MFA)
Mika Hämäläinen | Kimmo Leiviskä
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

This paper explores the growing epistemic parallel between NLG evaluation and grading of students in a Finnish University. We argue that both domains are experiencing a Great Misalignment Problem. As students increasingly use tools like ChatGPT to produce sophisticated outputs, traditional assessment methods that focus on final products rather than learning processes have lost their validity. To address this, we introduce the Pedagogical Multi-Factor Assessment (P-MFA) model, a process-based, multi-evidence framework inspired by the logic of multi-factor authentication.

pdf bib abs

ORACLE: Time-Dependent Recursive Summary Graphs for Foresight on News Data Using LLMs
Lev Kharlashkin | Eiaki V. Morooka | Yehor Tereschenko | Mika Hämäläinen
Proceedings of the 10th International Workshop on Computational Linguistics for Uralic Languages

ORACLE turns daily news into week-over-week, decision-ready insights for one of the Finnish University of Applied Sciences. The platform crawls and versions news, applies University-specific relevance filtering, embeds content, classifies items into PESTEL dimensions and builds a concise Time-Dependent Recursive Summary Graph (TRSG): two clustering layers summarized by an LLM and recomputed weekly. A lightweight change detector highlights what is new, removed or changed, then groups differences into themes for PESTEL-aware analysis. We detail the pipeline, discuss concrete design choices that make the system stable in production and present a curriculum-intelligence use case with an evaluation plan.

pdf bib abs

Studying the Representation of the LGBTQ+ Community in RuPaul’s Drag Race with LLM-Based Topic Modeling
Mika Hämäläinen
Proceedings of the Queer in AI Workshop

This study investigates the representation of LGBTQ+ community in the widely acclaimed reality television series RuPaul’s Drag Race through a novel application of large language model (LLM)-based topic modeling. By analyzing subtitles from seasons 1 to 16, the research identifies a spectrum of topics ranging from empowering themes, such as self-expression through drag, community support and positive body image, to challenges faced by the LGBTQ+ community, including homophobia, HIV and mental health. Employing an LLM allowed for nuanced exploration of these themes, overcoming the limitations of traditional word-based topic modeling.

pdf bib abs

On Psychology of AI – Does Primacy Effect Affect ChatGPT and Other LLMs?
Mika Hämäläinen
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

We study the primacy effect in three commercial LLMs: ChatGPT, Gemini and Claude. We do this by repurposing the famous experiment Asch (1946) conducted using human subjects. The experiment is simple, given two candidates with equal descriptions which one is preferred if one description has positive adjectives first before negative ones and another description has negative adjectives followed by positive ones. We test this in two experiments. In one experiment, LLMs are given both candidates simultaneously in the same prompt, and in another experiment, LLMs are given both candidates separately. We test all the models with 200 candidate pairs. We found that, in the first experiment, ChatGPT preferred the candidate with positive adjectives listed first, while Gemini preferred both equally often. Claude refused to make a choice. In the second experiment, ChatGPT and Claude were most likely to rank both candidates equally. In the case where they did not give an equal rating, both showed a clear preference to a candidate that had negative adjectives listed first. Gemini was most likely to prefer a candidate with negative adjectives listed first.

pdf bib abs

LLM-Assisted, Iterative Curriculum Writing: A Human-Centered AI Approach in Finnish Higher Education
Leo Huovinen | Mika Hämäläinen
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

This paper details an LLM-assisted system designed to support curriculum writing within a Finnish higher education institution. Developed over 18 months through iterative prototyping, workshops, and user testing with faculty, the tool functions as a collaborative partner. It provides structured suggestions and analyzes course content for alignment with institutional goals and standards like UN SDGs, aiming to reduce educator cognitive load while keeping humans central to the process. The paper presents the system’s technical architecture, findings from user feedback (including quotes and evaluation metrics), and discusses its potential to aid complex educational planning compared to generic AI tools.

2024

pdf bib

Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Flammie Pirinen | Melany Macias | Mario Crespo Avila
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

pdf bib abs

Analyzing Pokémon and Mario Streamers’ Twitch Chat with LLM-based User Embeddings
Mika Hämäläinen | Jack Rueter | Khalid Alnajjar
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougDoug and PointCrow. Our findings suggest that each streamer has their own type of chatters, however two categories emerge for all of the streamers: supportive viewers and emoji and reaction senders. Repetitive message spammers is a shared chatter category for two of the streamers.

pdf bib

Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | So Miyagawa | Khalid Alnajjar | Yuri Bizzoni
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

pdf bib abs

Scaling Sustainable Development Goal Predictions across Languages: From English to Finnish
Melany Macias | Lev Kharlashkin, | Leo Huovinen | Mika Hämäläinen
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

In this paper, we leverage an exclusive English dataset to train diverse multilingual classifiers, investigating their efficacy in adapting to Finnish data. We employ an exclusively English classification dataset of UN Sustainable Development Goals (SDG) in an education context, to train various multilingual classifiers and examine how well these models can adapt to recognizing the same classes within Finnish university course descriptions. It’s worth noting that Finnish, with a mere 5 million native speakers, presents a significantly less-resourced linguistic context compared to English. The best performing model in our experiments was mBART with an F1-score of 0.843.

pdf bib abs

Legal and Ethical Considerations that Hinder the Use of LLMs in a Finnish Institution of Higher Education
Mika Hämäläinen
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024

Large language models (LLMs) make it possible to solve many business problems easier than ever before. However, embracing LLMs in an organization may be slowed down due to ethical and legal considerations. In this paper, we will describe some of these issues we have faced at our university while developing university-level NLP tools to empower teaching and study planning. The identified issues touch upon topics such as GDPR, copyright, user account management and fear towards the new technology.

pdf bib abs

DAG: Dictionary-Augmented Generation for Disambiguation of Sentences in Endangered Uralic Languages using ChatGPT
Mika Hämäläinen
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

We showcase that ChatGPT can be used to disambiguate lemmas in two endangered languages ChatGPT is not proficient in, namely Erzya and Skolt Sami. We augment our prompt by providing dictionary translations of the candidate lemmas to a majority language - Finnish in our case. This dictionary augmented generation approach results in 50% accuracy for Skolt Sami and 41% accuracy for Erzya. On a closer inspection, many of the error types were of the kind even an untrained human annotator would make.

pdf bib abs

Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.

pdf bib abs

Empowering Teachers with Usability-Oriented LLM-Based Tools for Digital Pedagogy
Melany Vanessa Macias | Lev Kharlashkin | Leo Einari Huovinen | Mika Hämäläinen
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

We present our work on two LLM-based tools that utilize artificial intelligence and creative technology to improve education. The first tool is a Moodle AI plugin, which helps teachers manage their course content more efficiently using AI-driven analysis, content generation, and an interactive chatbot. The second one is a curriculum planning tool that provides insight into the sustainability, work-life relevance, and workload of each course. Both of these tools have the common goal of integrating sustainable development goals (UN SDGs) into teaching, among other things. We will describe the usability-focused and user-centric approach we have embraced when developing these tools.

2023

pdf bib

Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

pdf bib

RooAd: A Computationally Creative Online Advertisement Generator
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 1st International Workshop on Multilingual, Multimodal and Multitask Language Generation

pdf bib abs

Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Neural Machine Translation (NMT) has made significant strides in breaking down language barriers around the globe. For lesser-resourced languages like Moksha and Erzya, however, the development of robust NMT systems remains a challenge due to the scarcity of parallel corpora. This paper presents a novel approach to address this challenge by leveraging the existing rule-based machine translation system Apertium as a tool for synthetic data generation. We fine-tune NLLB-200 for Moksha-Erzya translation and obtain a BLEU of 0.73 on the Apertium generated data. On real world data, we got an improvement of 0.058 BLEU score over Apertium.

pdf bib abs

The Great Digital Humanities Disconnect: The Failure of DH Publishing
Emily Öhman | Michael Piotrowski | Mika Hämäläinen
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

In this paper, we discuss the disconnect in interdisciplinary publishing from a disciplinary divide perspective as to how research is expected to be presented and published according to disciplinary conventions. We argue that this divide hinders interdisciplinary collaboration and even more so the dissemination of research results from interdisciplinary projects to other interdisciplinary researchers. The disconnect is not simply theoretical but also encompasses practical considerations such as manuscript creation standards. The disconnect can also be detrimental to academic careers in terms of evaluations by peers on funding and tenure committees as well as peer reviews. With this analysis, we want to foster further discussion about the state of academic publishing from a digital humanities perspective.

pdf bib abs

Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods
Mika Hämäläinen | Jack Rueter | Khalid Alnajjar | Niko Partanen
Proceedings of the Big Picture Workshop

We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are structured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica

pdf bib abs

Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language data through the aligned word embeddings. To test our model, we annotated a small sentiment analysis corpus for the 4 endangered languages and Finnish. Our method reached at least 56% accuracy for each endangered language. The models and the sentiment corpus will be released together with this paper. Our research shows that state-of-the-art neural models can be used with endangered languages with the only requirement being a dictionary between the endangered language and a majority language.

pdf bib abs

Modelling the Reduplicating Lushootseed Morphology with an FST and LSTM
Jack Rueter | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

In this paper, we present an FST based approach for conducting morphological analysis, lemmatization and generation of Lushootseed words. Furthermore, we use the FST to generate training data for an LSTM based neural model and train this model to do morphological analysis. The neural model reaches a 71.9% accuracy on the test data. Furthermore, we discuss reduplication types in the Lushootseed language forms. The approach involves the use of both attested instances of reduplication and bare stems for applying a variety of reduplications to, as it is unclear just how much variation can be attributed to the individual speakers and authors of the source materials. That is, there may be areal factors that can be aligned with certain types of reduplication and their frequencies.

2022

pdf bib abs

Emotion Conditioned Creative Dialog Generation
Khalid Alnajjar | Mika Hämäläinen
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

We present a DialGPT based model for generating creative dialog responses that are conditioned based on one of the following emotions: anger, disgust, fear, happiness, pain, sadness and surprise. Our model is capable of producing a contextually apt response given an input sentence and a desired emotion label. Our model is capable of expressing the desired emotion with an accuracy of 0.6. The best performing emotions are neutral, fear and disgust. When measuring the strength of the expressed emotion, we find that anger, fear and disgust are expressed in the most strong fashion by the model.

pdf bib abs

Using Graph-Based Methods to Augment Online Dictionaries of Endangered Languages
Khalid Alnajjar | Mika Hämäläinen | Niko Tapio Partanen | Jack Rueter
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Latvian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, English and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionaries. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livonian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.

pdf bib abs

Help from the Neighbors: Estonian Dialect Normalization Using a Finnish Dialect Generator
Mika Hämäläinen | Khalid Alnajjar | Tuuli Tuisk
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonian - standard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to “dialectalize” standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.

pdf bib

Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

pdf bib

Automatic Generation of Factual News Headlines in Finnish
Maximilian Koppatz | Khalid Alnajjar | Mika Hämäläinen | Thierry Poibeau
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib abs

When to Laugh and How Hard? A Multimodal Approach to Detecting Humor and Its Intensity
Khalid Alnajjar | Mika Hämäläinen | Jörg Tiedemann | Jorma Laaksonen | Mikko Kurimo
Proceedings of the 29th International Conference on Computational Linguistics

Prerecorded laughter accompanying dialog in comedy TV shows encourages the audience to laugh by clearly marking humorous moments in the show. We present an approach for automatically detecting humor in the Friends TV show using multimodal data. Our model is capable of recognizing whether an utterance is humorous or not and assess the intensity of it. We use the prerecorded laughter in the show as annotation as it marks humor and the length of the audience’s laughter tells us how funny a given joke is. We evaluate the model on episodes the model has not been exposed to during the training phase. Our results show that the model is capable of correctly detecting whether an utterance is humorous 78% of the time and how long the audience’s laughter reaction should last with a mean absolute error of 600 milliseconds.

pdf bib abs

Ring That Bell: A Corpus and Method for Multimodal Metaphor Detection in Videos
Khalid Alnajjar | Mika Hämäläinen | Shuo Zhang
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)

We present the first openly available multimodal metaphor annotated corpus. The corpus consists of videos including audio and subtitles that have been annotated by experts. Furthermore, we present a method for detecting metaphors in the new dataset based on the textual content of the videos. The method achieves a high F1-score (62%) for metaphorical labels. We also experiment with other modalities and multimodal methods; however, these methods did not out-perform the text-based model. In our error analysis, we do identify that there are cases where video could help in disambiguating metaphors, however, the visual cues are too subtle for our model to capture. The data is available on Zenodo.

2021

pdf bib abs

Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered
Mika Hämäläinen | Niko Partanen | Jack Rueter | Khalid Alnajjar
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.

pdf bib abs

Processing M.A. Castrén’s Materials: Multilingual Historical Typed and Handwritten Manuscripts
Niko Partanen | Jack Rueter | Khalid Alnajjar | Mika Hämäläinen
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813–1852). The Finno-Ugrian Society is publishing Castrén’s manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.

pdf bib abs

Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many other less common methods exist. The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value, among many others. Our guidelines for future evaluation include clearly defining the goal of the generative system, asking questions as concrete as possible, testing the evaluation setup, using multiple different evaluation setups, reporting the entire evaluation process and potential biases clearly, and finally analyzing the evaluation results in a more profound way than merely reporting the most typical statistics.

pdf bib abs

Rules Ruling Neural Networks - Neural vs. Rule-Based Grammar Checking for a Low Resource Language
Linda Wiechetek | Flammie A Pirinen | Mika Hämäläinen | Chiara Argese
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a corpus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.

pdf bib abs

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography
Mika Hämäläinen | Niko Partanen | Khalid Alnajjar
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3% accuracy in texts written by Agricola and 87.7% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.

pdf bib abs

The Great Misalignment Problem in Human Evaluation of NLP Methods
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

We outline the Great Misalignment Problem in natural language processing research, this means simply that the problem definition is not in line with the method proposed and the human evaluation is not in line with the definition nor the method. We study this misalignment problem by surveying 10 randomly sampled papers published in ACL 2020 that report results with human evaluation. Our results show that only one paper was fully in line in terms of problem definition, method and evaluation. Only two papers presented a human evaluation that was in line with what was modeled in the method. These results highlight that the Great Misalignment Problem is a major one and it affects the validity and reproducibility of results obtained by a human evaluation.

pdf bib

Proceedings of the Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

pdf bib abs

Apurinã Universal Dependencies Treebank
Jack Rueter | Marília Fernanda Pereira de Freitas | Sidney Da Silva Facundes | Mika Hämäläinen | Niko Partanen
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features — some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.

pdf bib

The Current State of Finnish NLP
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs

Finnish Dialect Identification: The Effect of Audio and Text
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57%, where as text and audio reach to 85%. Our code, models and data have been released openly on Github and Zenodo.

bib abs

Developing Keyboards for the Endangered Livonian Language
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the Fifth Workshop on Widening Natural Language Processing

We present our current work on developing keyboard layouts for a critically endangered Uralic language called Livonian. Our layouts work on Windows, MacOS and Linux. In addition, we have developed keyboard apps with predictive text for Android and iOS. This work has been conducted in collaboration with the language community.

pdf bib

Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and future
Jack Rueter | Niko Partanen | Mika Hämäläinen | Trond Trosterud
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages

pdf bib abs

Linguistic change and historical periodization of Old Literary Finnish
Niko Partanen | Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola. We analyse the error types that occur and appear in different decades, and use word error rate (WER) and different error types as a proxy for measuring linguistic innovation and change. We show that the proposed approach works, and the errors are connected to accumulating changes and innovations, which also results in a continuous decrease in the accuracy of the model. The described error types also guide further work in improving these models, and document the currently observed issues. We also have trained word embeddings for four centuries of lemmatized Old Literary Finnish, which are available on Zenodo.

pdf bib abs

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish
Quan Duong | Mika Hämäläinen | Simon Hengchen
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

pdf bib abs

Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97.2%. Our results suggest that the performance difference is due to a difference in the original training data. Furthermore, we find that a regular LSTM model works better than one trained with a pretrained word2vec model. These findings suggest that more work needs to be done for pretrained models in Finnish language as they have been trained on small and biased corpora.

pdf bib abs

TFW2V: An Enhanced Document Similarity Method for the Morphologically Rich Finnish Language
Quan Duong | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the Workshop on Natural Language Processing for Digital Humanities

Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language. At the same time, we propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data. Furthermore, we design an objective evaluation method which can be used as a framework for benchmarking text similarity approaches.

pdf bib abs

¡Qué maravilla! Multimodal Sarcasm Detection in Spanish: a Dataset and a Baseline
Khalid Alnajjar | Mika Hämäläinen
Proceedings of the Third Workshop on Multimodal Artificial Intelligence

We construct the first ever multimodal sarcasm dataset for Spanish. The audiovisual dataset consists of sarcasm annotated text that is aligned with video and audio. The dataset represents two varieties of Spanish, a Latin American variety and a Peninsular Spanish variety, which ensures a wider dialectal coverage for this global language. We present several models for sarcasm detection that will serve as baselines in the future research. Our results show that results with text only (89%) are worse than when combining text with audio (91.9%). Finally, the best results are obtained when combining all the modalities: text, audio and video (93.1%). Our dataset will be published on Zenodo with access granted by request.

pdf bib abs

Detecting Depression in Thai Blog Posts: a Dataset and a Baseline
Mika Hämäläinen | Pattama Patpong | Khalid Alnajjar | Niko Partanen | Jack Rueter
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53% accuracy with a Thai BERT model in detecting depression. This establishes a good baseline for future researcher on the same corpus. Furthermore, we identify a need for Thai embeddings that have been trained on a more varied corpus than Wikipedia. Our corpus, code and trained models have been released openly on Zenodo.

2020

pdf bib abs

Open-Source Morphology for Endangered Mordvinic Languages
Jack Rueter | Mika Hämäläinen | Niko Partanen
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.

pdf bib abs

Morphological Disambiguation of South Sámi with FSTs and Neural Networks
Mika Hämäläinen | Linda Wiechetek
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.

pdf bib abs

Ve’rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter | Niko Partanen
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

We present an open-source online dictionary editing system, Ve′rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.

pdf bib

On Editing Dictionaries for Uralic Languages in an Online Environment
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter
Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages

pdf bib

Speech Recognition for Endangered and Extinct Samoyedic languages
Niko Partanen | Mika Hämäläinen | Tiina Klooster
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

FST Morphology for the Endangered Skolt Sami Language
Jack Rueter | Mika Hämäläinen
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.

2019

pdf bib abs

Dialect Text Normalization to Normative Standard Finnish
Niko Partanen | Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for normative Finnish text. We work on a corpus consisting of dialectal data of 23 distinct Finnish dialects. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.

pdf bib abs

Co-Operation as an Asymmetric Form of Human-Computer Creativity. Case: Peace Machine
Mika Hämäläinen | Timo Honkela
Proceedings of the First Workshop on NLP for Conversational AI

This theoretical paper identifies a need for a definition of asymmetric co-creativity where creativity is expected from the computational agent but not from the human user. Our co-operative creativity framework takes into account that the computational agent has a message to convey in a co-operative fashion, which introduces a trade-off on how creative the computer can be. The requirements of co-operation are identified from an interdisciplinary point of view. We divide co-operative creativity in message creativity, contextual creativity and communicative creativity. Finally these notions are applied in the context of the Peace Machine system concept.

pdf bib abs

Revisiting NMT for Normalization of Early English Letters
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper studies the use of NMT (neural machine translation) as a normalization method for an early English letter corpus. The corpus has previously been normalized so that only less frequent deviant forms are left out without normalization. This paper discusses different methods for improving the normalization of these deviant forms by using different approaches. Adding features to the training data is found to be unhelpful, but using a lexicographical resource to filter the top candidates produced by the NMT model together with lemmatization improves results.

pdf bib abs

Morphosyntactic Disambiguation in an Endangered Language Setting
Jeff Ens | Mika Hämäläinen | Jack Rueter | Philippe Pasquier
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sentence, and their development is labour intensive. We present an LSTM based model for automatically ranking morphological readings of sentences based on their quality. This ranking can be used to evaluate the existing CG disambiguators or to directly morphologically disambiguate sentences. Our approach works on a morphological abstraction and it can be trained with a very small dataset.

pdf bib abs

Generating Modern Poetry Automatically in Finnish
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We present a novel approach for generating poetry automatically for the morphologically rich Finnish language by using a genetic algorithm. The approach improves the state of the art of the previous Finnish poem generators by introducing a higher degree of freedom in terms of structural creativity. Our approach is evaluated and described within the paradigm of computational creativity, where the fitness functions of the genetic algorithm are assimilated with the notion of aesthetics. The output is considered to be a poem 81.5% of the time by human evaluators.

pdf bib abs

Let’s FACE it. Finnish Poetry Generation with Aesthetics and Framing
Mika Hämäläinen | Khalid Alnajjar
Proceedings of the 12th International Conference on Natural Language Generation

We present a creative poem generator for the morphologically rich Finnish language. Our method falls into the master-apprentice paradigm, where a computationally creative genetic algorithm teaches a BRNN model to generate poetry. We model several parts of poetic aesthetics in the fitness function of the genetic algorithm, such as sonic features, semantic coherence, imagery and metaphor. Furthermore, we justify the creativity of our method based on the FACE theory on computational creativity and take additional care in evaluating our system by automatic metrics for concepts together with human evaluation for aesthetics, framing and expressions.

pdf bib abs

From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
Mika Hämäläinen | Simon Hengchen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.

pdf bib

Finding Sami Cognates with a Character-Based NMT Approach
Mika Hämäläinen | Jack Rueter
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf bib

Development of an Open Source Natural Language Generation Tool for Finnish
Mika Hämäläinen | Jack Rueter
Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages

pdf bib

Combining Concepts and Their Translations from Structured Dictionaries of Uralic Minority Languages
Mika Hämäläinen | Liisa Lotta Tarvainen | Jack Rueter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs

Poem Machine - a Co-creative NLG Web Application for Poem Writing
Mika Hämäläinen
Proceedings of the 11th International Conference on Natural Language Generation

We present Poem Machine, an interactive online tool for co-authoring Finnish poetry with a computationally creative agent. Poem Machine can produce poetry of its own and assist the user in authoring poems. The main target group for the system is primary school children, and its use as a part of teaching is currently under study.

pdf bib abs

Normalizing Early English Letters to Present-day English Spelling
Mika Hämäläinen | Tanja Säily | Jack Rueter | Jörg Tiedemann | Eetu Mäkelä
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.

pdf bib abs

A Master-Apprentice Approach to Automatic Creation of Culturally Satirical Movie Titles
Khalid Alnajjar | Mika Hämäläinen
Proceedings of the 11th International Conference on Natural Language Generation

Satire has played a role in indirectly expressing critique towards an authority or a person from time immemorial. We present an autonomously creative master-apprentice approach consisting of a genetic algorithm and an NMT model to produce humorous and culturally apt satire out of movie titles automatically. Furthermore, we evaluate the approach in terms of its creativity and its output. We provide a solid definition for creativity to maximize the objectiveness of the evaluation.