Petter Mæhlum - ACL Anthology

Petter Mæhlum

2025

LTG at VarDial 2025 NorSID: More and Better Training Data for Slot and Intent Detection
Marthe Midtgaard | Petter Mæhlum | Yves Scherrer
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the LTG submission to the VarDial 2025 shared task, where we participate in the Norwegian slot and intent detection subtasks. The shared task focuses on Norwegian dialects, which present challenges due to their low-resource nature and variation. We test a variety of neural models and training data configurations, with the focus on improving and extending the available Norwegian training data. This includes automatically re-aligning slot spans in Norwegian Bokmål, as well as re-translating the original English training data into both Bokmål and Nynorsk. % to address dialectal diversity. We also re-annotate an external Norwegian dataset to augment the training data. Our best models achieve first place in both subtasks, achieving an span F1 score of 0.893 for slot filling and an accuracy of 0.980 for intent detection. Our results indicate that while translation quality is less critical, improving the slot labels has a notable impact on slot performance. Moreover, adding more standard Norwegian data improves performance, but incorporating even small amounts of dialectal data leads to greater gains.

The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

Multi-label Scandinavian Language Identification (SLIDE)
Mariia Fedorova | Jonas Sebulon Frydenberg | Victoria Handford | Victoria Ovedie Chruickshank Langø | Solveig Helene Willoch | Marthe Løken Midtgaard | Yves Scherrer | Petter Mæhlum | David Samuel
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

Improved Norwegian Bokmål Translations for FLORES
Petter Mæhlum | Anders Næss Evensen | Yves Scherrer
Proceedings of the Tenth Conference on Machine Translation

FLORES+ is a collection of parallel datasets obtained by translation from originally English source texts. FLORES+ contains Norwegian translations for the two official written variants of Norwegian: Norwegian Bokmål and Norwegian Nynorsk. However, the earliest Bokmål version contained non-native-like mistakes, and even after a later revision, the dataset contained grammatical and lexical errors. This paper aims at correcting unambiguous mistakes, and thus creating a new version of the Bokmål dataset. At the same time, we provide a translation into Radical Bokmål, a sub-variety of Norwegian which is closer to Nynorsk in some aspects, while still being within the official norms for Bokmål. We discuss existing errors and differences in the various translations and the corrections that we provide.

A Collection of Question Answering Datasets for Norwegian
Vladislav Mikhailov | Petter Mæhlum | Victoria Ovedie Chruickshank Langø | Erik Velldal | Lilja Øvrelid
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian – Bokmål and Nynorsk – our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.

Mixed Feelings: Cross-Domain Sentiment Classification of Patient Feedback
Egil Rønningstad | Lilja Charlotte Storset | Petter Mæhlum | Lilja Øvrelid | Erik Velldal
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Sentiment analysis of patient feedback from the public health domain can aid decision makers in evaluating the provided services. The current paper focuses on free-text comments in patient surveys about general practitioners and psychiatric healthcare, annotated with four sentence-level polarity classes - positive, negative, mixed and neutral - while also attempting to alleviate data scarcity by leveraging general-domain sources in the form of reviews. For several different architectures, we compare in-domain and out-of-domain effects, as well as the effects of training joint multi-domain models.

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Findings of the VarDial Evaluation Campaign 2025: The NorSID Shared Task on Norwegian Slot, Intent and Dialect Identification
Yves Scherrer | Rob van der Goot | Petter Mæhlum
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

The VarDial Evaluation Campaign 2025 was organized as part of the twelfth workshop on Natural Language Processing for Similar Languages, Varieties and Dialects (VarDial), colocated with COLING 2025. It consisted of one shared task with three subtasks: intent detection, slot filling and dialect identification for Norwegian dialects. This report presents the results of this shared task. Four participating teams have submitted systems with very high performance (> 97% accuracy) for intent detection, whereas slot detection and dialect identification showed to be much more challenging, with respectively span-F1 scores up to 89%, and weighted dialect F1 scores of 84%.

2024

Estimating Lexical Complexity from Document-Level Distributions
Sondre Wold | Petter Mæhlum | Oddbjørn Hove
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Existing methods for complexity estimation are typically developed for entire documents. This limitation in scope makes them inapplicable for shorter pieces of text, such as health assessment tools. These typically consist of lists of independent sentences, all of which are too short for existing methods to apply. The choice of wording in these assessment tools is crucial, as both the cognitive capacity and the linguistic competency of the intended patient groups could vary substantially. As a first step towards creating better tools for supporting health practitioners, we develop a two-step approach for estimating lexical complexity that does not rely on any pre-annotated data. We implement our approach for the Norwegian language and verify its effectiveness using statistical testing and a qualitative evaluation of samples from real assessment tools. We also investigate the relationship between our complexity measure and certain features typically associated with complexity in the literature, such as word length, frequency, and the number of syllables.

It’s Difficult to Be Neutral – Human and LLM-based Sentiment Annotation of Patient Comments
Petter Mæhlum | David Samuel | Rebecka Maria Norman | Elma Jelin | Øyvind Andresen Bjertnæs | Lilja Øvrelid | Erik Velldal
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.

NoMusic - The Norwegian Multi-Dialectal Slot and Intent Detection Corpus
Petter Mæhlum | Yves Scherrer
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

This paper presents a new textual resource for Norwegian and its dialects. The NoMusic corpus contains Norwegian translations of the xSID dataset, an evaluation dataset for spoken language understanding (slot and intent detection). The translations cover Norwegian Bokmål, as well as eight dialects from three of the four major Norwegian dialect areas. To our knowledge, this is the first multi-parallel resource for written Norwegian dialects, and the first evaluation dataset for slot and intent detection focusing on non-standard Norwegian varieties. In this paper, we describe the annotation process and provide some analyses on the types of linguistic variation that can be found in the dataset.

EDEN: A Dataset for Event Detection in Norwegian News
Samia Touileb | Jeanett Murstad | Petter Mæhlum | Lubos Steskal | Lilja Charlotte Storset | Huiling You | Lilja Øvrelid
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present EDEN, the first Norwegian dataset annotated with event information at the sentence level, adapting the widely used ACE event schema to Norwegian. The paper describes the manual annotation of Norwegian text as well as transcribed speech in the news domain, together with inter-annotator agreement and discussions of relevant dataset statistics. We also present preliminary modeling results using a graph-based event parser. The resulting dataset will be freely available for download and use.

2023

A Diagnostic Dataset for Sentiment and Negation Modeling for Norwegian
Petter Mæhlum | Erik Velldal | Lilja Øvrelid
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

Negation constitutes a challenging phenomenon for many natural language processing tasks, such as sentiment analysis (SA). In this paper we investigate the relationship between negation and sentiment in the context of Norwegian professional reviews. The first part of this paper includes a corpus study which investigates how negation is tied to sentiment in this domain, based on existing annotations. In the second part, we introduce NoReC-NegSynt, a synthetically augmented test set for negation and sentiment, to allow for a more detailed analysis of the role of negation in current neural SA models. This diagnostic test set, containing both clausal and non-clausal negation, allows for analyzing and comparing models’ abilities to treat several different types of negation. We also present a case-study, applying several neural SA models to the diagnostic data.

Identifying Token-Level Dialectal Features in Social Media
Jeremy Barnes | Samia Touileb | Petter Mæhlum | Pierre Lison
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Dialectal variation is present in many human languages and is attracting a growing interest in NLP. Most previous work concentrated on either (1) classifying dialectal varieties at the document or sentence level or (2) performing standard NLP tasks on dialectal data. In this paper, we propose the novel task of token-level dialectal feature prediction. We present a set of fine-grained annotation guidelines for Norwegian dialects, expand a corpus of dialectal tweets, and manually annotate them using the introduced guidelines. Furthermore, to evaluate the learnability of our task, we conduct labeling experiments using a collection of baselines, weakly supervised and supervised sequence labeling models. The obtained results show that, despite the difficulty of the task and the scarcity of training data, many dialectal features can be predicted with reasonably high accuracy.

Phonotactics as an Aid in Low Resource Loan Word Detection and Morphological Analysis in Sakha
Petter Mæhlum | Sardana Ivanova
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

Obtaining information about loan words and irregular morphological patterns can be difficult for low-resource languages. Using Sakha as an example, we show that it is possible to exploit known phonemic regularities such as vowel harmony and consonant distributions to identify loan words and irregular patterns, which can be helpful in rule-based downstream tasks such as parsing and POS-tagging. We evaluate phonemically inspired methods for loanword detection, combined with bi-gram vowel transition probabilities to inspect irregularities in the morphology of loanwords. We show that both these techniques can be useful for the detection of such patterns. Finally, we inspect the plural suffix -ЛАр [-LAr] to observe some of the variation in morphology between native and foreign words.

2022

NorDiaChange: Diachronic Semantic Change Dataset for Norwegian
Andrey Kutuzov | Samia Touileb | Petter Mæhlum | Tita Enstad | Alexandra Wittemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We describe NorDiaChange: the first diachronic semantic change dataset for Norwegian. NorDiaChange comprises two novel subsets, covering about 80 Norwegian nouns manually annotated with graded semantic change over time. Both datasets follow the same annotation procedure and can be used interchangeably as train and test splits for each other. NorDiaChange covers the time periods related to pre- and post-war events, oil and gas discovery in Norway, and technological developments. The annotation was done using the DURel framework and two large historical Norwegian corpora. NorDiaChange is published in full under a permissive licence, complete with raw annotation data and inferred diachronic word usage graphs (DWUGs).

Annotating Norwegian language varieties on Twitter for Part-of-speech
Petter Mæhlum | Andre Kåsen | Samia Touileb | Jeremy Barnes
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data.

NARC – Norwegian Anaphora Resolution Corpus
Petter Mæhlum | Dag Haug | Tollef Jørgensen | Andre Kåsen | Anders Nøklestad | Egil Rønningstad | Per Erik Solberg | Erik Velldal | Lilja Øvrelid
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

We present the Norwegian Anaphora Resolution Corpus (NARC), the first publicly available corpus annotated with anaphoric relations between noun phrases for Norwegian. The paper describes the annotated data for 326 documents in Norwegian Bokmål, together with inter-annotator agreement and discussions of relevant statistics. We also present preliminary modelling results which are comparable to existing corpora for other languages, and discuss relevant problems in relation to both modelling and the annotations themselves.

2021

Negation in Norwegian: an annotated dataset
Petter Mæhlum | Jeremy Barnes | Robin Kurtz | Lilja Øvrelid | Erik Velldal
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper introduces NorecNeg – the first annotated dataset of negation for Norwegian. Negation cues and their in-sentence scopes have been annotated across more than 11K sentences spanning more than 400 documents for a subset of the Norwegian Review Corpus (NoReC). In addition to providing in-depth discussion of the annotation guidelines, we also present a first set of benchmark results based on a graph-parsing approach.

NorDial: A Preliminary Corpus of Written Norwegian Dialect Use
Jeremy Barnes | Petter Mæhlum | Samia Touileb
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Norway has a large amount of dialectal variation, as well as a general tolerance to its use in the public sphere. There are, however, few available resources to study this variation and its change over time and in more informal areas, on social media. In this paper, we propose a first step to creating a corpus of dialectal variation of written Norwegian. We collect a small corpus of tweets and manually annotate them as Bokmål, Nynorsk, any dialect, or a mix. We further perform preliminary experiments with state-of-the-art models, as well as an analysis of the data to expand this corpus in the future. Finally, we make the annotations available for future work.

2020

A Fine-grained Sentiment Dataset for Norwegian
Lilja Øvrelid | Petter Mæhlum | Jeremy Barnes | Erik Velldal
Proceedings of the Twelfth Language Resources and Evaluation Conference

We here introduce NoReC_fine, a dataset for fine-grained sentiment analysis in Norwegian, annotated with respect to polar expressions, targets and holders of opinion. The underlying texts are taken from a corpus of professionally authored reviews from multiple news-sources and across a wide variety of domains, including literature, games, music, products, movies and more. We here present a detailed description of this annotation effort. We provide an overview of the developed annotation guidelines, illustrated with examples and present an analysis of inter-annotator agreement. We also report the first experimental results on the dataset, intended as a preliminary benchmark for further experiments.

2019

Annotating evaluative sentences for sentiment analysis: a dataset for Norwegian
Petter Mæhlum | Jeremy Barnes | Lilja Øvrelid | Erik Velldal
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper documents the creation of a large-scale dataset of evaluative sentences – i.e. both subjective and objective sentences that are found to be sentiment-bearing – based on mixed-domain professional reviews from various news-sources. We present both the annotation scheme and first results for classification experiments. The effort represents a step toward creating a Norwegian dataset for fine-grained sentiment analysis.

Co-authors

Andrey Kutuzov 3

Vladislav Mikhailov 3

Mariia Fedorova 2

Victoria Ovedie Chruickshank Langø 2

Stephan Oepen 2

Egil Rønningstad 2

Lilja Charlotte Storset 2

Nikolay Arefyev 1

Marta Bañón 1

Magnus Breder Birkenes 1

Øyvind Andresen Bjertnæs 1

Rolv-Arild Braaten 1

Svein Arne Brygfjeld 1

Laurie Burchell 1

Javier De La Rosa 1

Hans Christian Farsethås 1

Jonas Sebulon Frydenberg 1

Rob Van Der Goot 1

Liane Guillou 1

Jon Atle Gulla 1

Victoria Handford 1

Jindřich Helcl 1

Erik Henriksson 1

Oddbjørn Hove 1

Sardana Ivanova 1

Tollef Jørgensen 1

Mateusz Klimaszewski 1

Ville Komulainen 1

Joona Kytöniemi 1

Veronika Laippala 1

Bhavitvya Malik 1

Farrokh Mehryary 1

Marthe Midtgaard 1

Marthe Løken Midtgaard 1

Jeanett Murstad 1

Aslak Sira Myhre 1

Amanda Myntti 1

Rebecka Maria Norman 1

Anders Næss Evensen 1

Anders Nøklestad 1

Dayyán O’Brien 1

Sampo Pyysalo 1

Gema Ramírez-Sánchez 1

Per Erik Solberg 1

Pavel Stepachev 1

Lubos Steskal 1

Jörg Tiedemann 1

Tereza Vojtěchová 1

Freddy Wetjen 1

Solveig Helene Willoch 1

Alexandra Wittemann 1

Jaume Zaragoza-Bernabeu 1

Ona de Gibert 1

Wilfred Østgulen 1

Venues