Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Richard Johansson, Sara Stymne (Editors)

Anthology ID:: 2025.nodalida-1
Month:: march
Year:: 2025
Address:: Tallinn, Estonia
Venue:: NoDaLiDa
SIG:
Publisher:: University of Tartu Library
URL:: https://aclanthology.org/2025.nodalida-1/
DOI:
ISBN:: 978-9908-53-109-0
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.nodalida-1.pdf

pdf bib
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Richard Johansson | Sara Stymne

pdf bib abs
Annotating and Classifying Direct Speech in Historical Danish and Norwegian Literary Texts
Ali Al-Laith | Alexander Conroy | Kirstine Nielsen Degn | Jens Bjerring-Hansen | Daniel Hershcovich

Analyzing direct speech in historical literary texts provides insights into character dynamics, narrative style, and discourse patterns. In late 19th century Danish and Norwegian fiction direct speech reflects characters’ social and geographical backgrounds. However, inconsistent typographic conventions in Scandinavian literature complicate computational methods for distinguishing direct speech from other narrative elements. To address this, we introduce an annotated dataset from the MeMo corpus, capturing speech markers and tags in Danish and Norwegian novels. We evaluate pre-trained language models for classifying direct speech, with results showing that a Danish Foundation Model (DFM), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find a downward trend in the prevalence of speech over time.

pdf bib abs
Diachronic Analysis of Phrasal Verbs in English Scientific Writing
Diego Alves

Phrasal verbs (PVs) are a specific type of multi-word expressions and a specific feature of the English language. However, their usage in scientific prose is limited. Our study focuses on the analysis of phrasal verbs in the scientific domain using information theory methods to describe diachronic phenomena such as conventionalization and diversification regarding the usage of PVs. Thus, we analysed their developmental trajectory over time from the mid-17th century to the end of the 20th century by measuring the relative entropy (Kullback-Leibler divergence), predictability in context of the phrasal verbs particles (surprisal), and the paradigmatic variability using word embedding spaces. We were able to identify interesting phenomena such as the process of conventionalization over the 20th century and the peaks of diversification throughout the centuries.

pdf bib abs
Applying and Optimising a Multi-Scale Probit Model for Cross-Source Text Complexity Classification and Ranking in Swedish
Elsa Andersson | Johan Falkenjack | Arne Jönsson

We present results from using Probit models to classify and rank texts of varying complexity from multiple sources. We use multiple linguistic sources including Swedish easy-to-read books and investigate data augmentation and feature regularisation as optimisation methods for text complexity assessment. Multi-Scale and Single Scale Probit models are implemented using different ratios of training data, and then compared. Overall, the findings suggest that the Multi-Scale Probit model is an effective method for classifying text complexity and ranking new texts and could be used to improve the performance on small datasets as well as normalize datasets labelled using different scales.

We present the Icelandic Standardization Benchmark Set: Spelling and Punctuation (IceStaBS:SP), a dataset designed to provide standardized text examples for Icelandic orthography. The dataset includes non-standard orthography examples and their standardized counterparts, along with detailed explanations based on official Icelandic spelling rules. IceStaBS:SP aims to support the development and evaluation of automatic spell and grammar checkers, particularly in educational settings. We evaluate various spell and grammar checkers using IceStaBS:SP, demonstrating its utility as a benchmarking tool and highlighting areas for future improvement.

pdf bib abs
An Icelandic Linguistic Benchmark for Large Language Models
Bjarki Ármannsson | Finnur Ágúst Ingimundarson | Einar Freyr Sigurðsson

This paper introduces a linguistic benchmark for Icelandic-language LLMs, the first of its kind manually constructed by native speakers. We report on the scores obtained by current state-of-the-art models, which indicate room for improvement, and discuss the theoretical problems involved in creating such a benchmark and scoring a model’s performance.

pdf bib abs
Transfer-Learning German Metaphors Inspired by Second Language Acquisition
Maria Berger

A major part of figurative meaning prediction is based on English-language training corpora. One strategy to apply techniques to languages other than English lies in applying transfer learning techniques to correct this imbalance. However, in previous studies we learned that the bilingual representations of current transformer models are incapable of encoding the deep semantic knowledge necessary for a transfer learning step, especially for metaphor prediction. Hence, inspired by second language acquisition, we attempt to improve German metaphor prediction in transfer learning by modifying the context windows of our input samples to align with lower readability indices achieving up to 13% higher F1 score.

pdf bib abs
Comparative Concepts or Descriptive Categories: a UD Case study
Matthieu Pierre Boyer | Mathieu Dehouck

In this paper, we present a series of methods used to quantify the soundness of using the same names to annotate cases in different languages. We follow the idea described by Martin Haspelmath that descriptive categories and comparative concepts are different objects and we look at the necessary simplification taken by the Universal Dependencies project. We thus compare cases in closely related languages as belonging to commensurable descriptive categories. Then we look at the corresponding underlying comparative concepts. We finally looked at the possibility of assigning cases to adpositions.

pdf bib abs
Investigating the effectiveness of Data Augmentation and Contrastive Learning for Named Entity Recognition
Noel Chia | Ines Rehbein | Simone Paolo Ponzetto

Data Augmentation (DA) and Contrastive Learning (CL) are widely used in NLP, but their potential for NER has not yet been investigated in detail. Existing work is mostly limited to zero- and few-shot scenarios where improvements over the baseline are easy to obtain. In this paper, we address this research gap by presenting a systematic evaluation of DA for NER on small, medium-sized and large datasets with coarse and fine-grained labels. We report results for a) DA only, b) DA in combination with supervised contrastive learning, and c) DA with transfer learning. Our results show that DA on its own fails to improve results over the baseline and that supervised CL works better on larger datasets while transfer learning is beneficial if the target dataset is very small. Finally, we investigate how contrastive learning affects the learned representations, based on dimensionality reduction and visualisation techniques, and show that CL mostly helps to separate named entities from non-entities.

The evaluation of Large Language Models (LLMs) is one of the crucial current challenges in the field of Natural Language Processing (NLP) and becomes even more challenging in the multilingual setting. Since the majority of the community’s benchmarks exist only in English, test sets are now being machine translated at scale into dozens of languages. This work explores the feasibility of that approach, comparing a Finnish machine translation (MT) of ARC-Challenge with a new human translated version. Our findings suggest that since absolute scores are fairly close and model size rankings are preserved, machine translation is adequate in this case. Surprisingly, however, the datasets reverse the order of base models compared to their chat-finetuned counterparts.

pdf bib abs
GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian
Aleksei Dorkin | Kairit Sirts

We present GliLem—a novel hybrid lemmatization system for Estonian that enhances the highly accurate rule-based morphological analyzer Vabamorf with an external disambiguation module based on GliNER—an open vocabulary NER model that is able to match text spans with text labels in natural language. We leverage the flexibility of a pre-trained GliNER model to improve the lemmatization accuracy of Vabamorf by 10% compared to its original disambiguation module and achieve an improvement over the token classification-based baseline. To measure the impact of improvements in lemmatization accuracy on the information retrieval downstream task, we first created an information retrieval dataset for Estonian by automatically translating the DBpedia-Entity dataset from English. We benchmark several token normalization approaches, including lemmatization, on the created dataset using the BM25 algorithm. We observe a substantial improvement in IR metrics when using lemmatization over simplistic stemming. The benefits of improving lemma disambiguation accuracy manifest in small but consistent improvement in the IR recall measure, especially in the setting of high k.

pdf bib abs
Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway
Tita Enstad | Trond Trosterud | Marie Iversdatter Røsok | Yngvil Beyer | Marie Roald

Optical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN’s collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN’s collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.

pdf bib abs
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
Naome A. Etori | Arturs Kanepajs | Kevin Lu | Randu Karisa

This paper evaluates the language understanding capabilities of various large language models (LLMs) through an analysis of 112 translated and human-edited questions from the Multitask Language Understanding (MMLU) dataset, focusing specifically on two underrepresented languages: Latvian and Giriama. The study compares the performance of six state-of-the-art (SOTA) models, with OpenAI’s o1-preview model demonstrating superior performance across all languages, significantly outperforming non-proprietary models in Latvian and all other models in Giriama. Human editing of automated translations from English to Latvian yielded only a small, statistically insignificant improvement in performance estimates, suggesting that machine-translated benchmarks may be sufficient for comparing model performance in languages with established digital resources like Latvian. However, automated translation to Giriama proved infeasible, and model performance in Giriama remained poor, highlighting the persistent challenges LLMs face with low-resource languages. These findings underscore the need for more comprehensive datasets and improved machine translation capabilities for underrepresented languages, while emphasizing the importance of localized benchmarks and human evaluation in addressing cultural and contextual limitations in AI models.

pdf bib abs
Better Benchmarking LLMs for Zero-Shot Dependency Parsing
Ana Ezquerro | Carlos Gómez-Rodríguez | David Vilares

While LLMs excel in zero-shot tasks, their performance in linguistic challenges like syntactic parsing has been less scrutinized. This paper studies state-of-the-art open-weight LLMs on the task by comparing them to baselines that do not have access to the input sentence, including baselines that have not been used in this context such as random projective trees or optimal linear arrangements. The results show that most of the tested LLMs cannot outperform the best uninformed baselines, with only the newest and largest versions of LLaMA doing so for most languages, and still achieving rather low performance. Thus, accurate zero-shot syntactic parsing is not forthcoming with open LLMs.

pdf bib abs
Optimizing Estonian TV Subtitles with Semi-supervised Learning and LLMs
Artem Fedorchenko | Tanel Alumäe

This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We finetune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.

pdf bib abs
Modeling Multilayered Complexity in Literary Texts
Pascale Feldkamp | Márton Kardos | Kristoffer Nielbo | Yuri Bizzoni

We explore the relationship between stylistic and sentimental complexity in literary texts, analyzing how they interact and affect overall complexity. Using a dataset of over 9,000 English novels (19th-20th century), we find that complexity at the stylistic/syntactic and sentiment levels tend to show a linear association. Finally, using dedicated datasets, we show that both stylistic/syntactic features – particularly those relating to information density – as well as sentiment features are related to text difficulty rank as well as average processing time.

pdf bib abs
Does Preprocessing Matter? An Analysis of Acoustic Feature Importance in Deep Learning for Dialect Classification
Lea Fischbach | Caroline Kleen | Lucie Flek | Alfred Lameli

This paper examines the effect of preprocessing techniques on spoken dialect classification using raw audio data. We focus on modifying Root Mean Square (RMS) amplitude, DC-offset, articulation rate (AR), pitch, and Harmonics-to-Noise Ratio (HNR) to assess their impact on model performance. Our analysis determines whether these features are important, irrelevant, or misleading for the classification task. To evaluate these effects, we use a pipeline that tests the significance of each acoustic feature through distortion and normalization techniques. While preprocessing did not directly improve classification accuracy, our findings reveal three key insights: deep learning models for dialect classification are generally robust to variations in the tested audio features, suggesting that normalization may not be necessary. We identify articulation rate as a critical factor, directly affecting the amount of information in audio chunks. Additionally, we demonstrate that intonation, specifically the pitch range, plays a vital role in dialect recognition.

pdf bib abs
Language of the Swedish Manosphere with Swedish FrameNet
Emilie Marie Carreau Francis

The manosphere is a loose group of online communities centralised around the themes of anti-feminism, misogyny, and hetero-masculinity. It has gained a reputation for violent extremism, particularly from members of the incel community. Sweden sees one of the highest volumes of online traffic to well-known incel forums in all of Europe. In spite of this, there is little information on manosphere/incel culutre in Swedish. This paper uses posts from Flashback’s manosphere subforum automatically annotated with Swedish FrameNet to analyse the language community in a Swedish context. To do so, a lexicon for the Swedish manosphere was created and terms of interest were identified in the Swedish discourse. Analysis of prominent semantic frames linked to these terms of interest presents a detailed look into the language of the Swedish manosphere.

pdf bib abs
Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments
Steinunn Rut Friðriksdóttir | Dan Saattrup Nielsen | Hafsteinn Einarsson

This paper presents Hotter and Colder, a dataset designed to analyze various types of online behavior in Icelandic blog comments. Building on previous work, we used GPT-4o mini to annotate approximately 800,000 comments for 25 tasks, including sentiment analysis, emotion detection, hate speech, and group generalizations. Each comment was automatically labeled on a 5-point Likert scale. In a second annotation stage, comments with high or low probabilities of containing each examined behavior were subjected to manual revision. By leveraging crowdworkers to refine these automatically labeled comments, we ensure the quality and accuracy of our dataset resulting in 12,232 uniquely annotated comments and 19,301 annotations. Hotter and Colder provides an essential resource for advancing research in content moderation and automatically detectiong harmful online behaviors in Icelandic. We release both the dataset and annotation interface.

pdf bib abs
Towards large-scale speech foundation models for a low-resource minority language
Yaroslav Getman | Tamás Grósz | Katri Hiovain-Asikainen | Tommi Lehtonen | Mikko Kurimo

Modern ASR systems require massive amounts of training data. While ASR training data for most languages are scarce and expensive to transcribe, a practical solution is to collect huge amounts of raw untranscribed speech and pre-train the ASR model in a self-supervised manner. Unfortunately, for many low-resource minority languages, even untranscribed speech data are scarce. In this paper, we propose a solution for the Northern Sámi language with 22,400 hours of speech extracted from the Finnish radio and television archives. We evaluated the model performance with different decoding algorithms and examined the models’ internal behavior with interpretation-based techniques.

pdf bib abs
OpusDistillery: A Configurable End-to-End Pipeline for Systematic Multilingual Distillation of Open NMT Models
Ona de Gibert | Tommi Nieminen | Yves Scherrer | Jörg Tiedemann

In this work, we introduce OpusDistillery, a novel framework to streamline the Knowledge Distillation (KD) process of multilingual NMT models. OpusDistillery’s main features are the integration of openly available teacher models from OPUS-MT and Hugging Face, comprehensive multilingual support and robust GPU utilization tracking. We describe the tool in detail and discuss the individual contributions of its pipeline components, demonstrating its flexibility for different use cases. OpusDistillery is open-source and released under a permissive license, aiming to facilitate further research and development in the field of multilingual KD for any sequence-to-sequence task. Our code is available at https://github.com/Helsinki-NLP/OpusDistillery.

pdf bib abs
Mind the Gap: Diverse NMT Models for Resource-Constrained Environments
Ona de Gibert | Dayyán O’Brien | Dušan Variš | Jörg Tiedemann

We present fast Neural Machine Translation models for 17 diverse languages, developed using Sequence-level Knowledge Distillation. Our selected languages span multiple language families and scripts, including low-resource languages. The distilled models achieve comparable performance while being 10x times faster than transformer-base and 35x times faster than transformer-big architectures. Our experiments reveal that teacher model quality and capacity strongly influence the distillation success, as well as the language script. We also explore the effectiveness of multilingual students. We release publicly our code and models in our Github repository: anonymised.

pdf bib abs
Testing relevant linguistic features in automatic CEFR skill level classification for Icelandic
Isidora Glišić | Caitlin Laura Richter | Anton Karl Ingason

This paper explores the use of various linguistic features to develop models for automatic classification of language proficiency on the CEFR scale for Icelandic, a low-resourced and morphologically complex language. We train two classifiers to assess skill level of learner texts. One is used as a baseline and takes in the original unaltered text written by a learner and uses predominantly surface features to assess the level. The other uses both surface and other morphological and lexical features, as well as context vectors from transformer (IceBERT). It takes in both the original and corrected versions of the text and takes into account errors/deviation of the original texts compared to the corrected versions. Both classifiers show promising results, with baseline models achieving between 62.2-67.1% accuracy and dual-version between 75-80.3%.

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.

pdf bib abs
Opinion Units: Concise and Contextualized Representations for Aspect-Based Sentiment Analysis
Emil Häglund | Johanna Björklund

We introduce opinion units, a contribution to the field Aspect-Based Sentiment Analysis (ABSA) that extends aspect- sentiment pairs by including substantiating excerpts, derived through hybrid abstractive-extractive summarisation. The goal is to provide fine-grained information without sacrificing succinctness and abstraction. Evaluations on review datasets demonstrate that large language models (LLMs) can accurately extract opinion units through few-shot learning. The main types of errors are providing incomplete contexts for opinions and and mischaracterising objective statements as opinions. The method reduces the need for labelled data and allows the LLM to dynamically define aspect types. As a practical evaluation, we present a case study on similarity search across academic datasets and public review data. The results indicate that searches leveraging opinion units are more successful than those relying on traditional data-segmentation strategies, showing robustness across datasets and embeddings.

pdf bib abs
Aligning Language Models for Icelandic Legal Text Summarization
Þórir Hrafn Harðarson | Hrafn Loftsson | Stefán Ólafsson

The integration of language models in the legal domain holds considerable promise for streamlining processes and improving efficiency in managing extensive workloads. However, the specialized terminology, nuanced language, and formal style of legal texts can present substantial challenges. This study examines whether preference-based training techniques, specifically Reinforcement Learning from Human Feedback and Direct Preference Optimization, can enhance models’ performance in generating Icelandic legal summaries that align with domain-specific language standards and user preferences. We compare models fine-tuned with preference training to those using conventional supervised learning. Results indicate that preference training improves the legal accuracy of generated summaries over standard fine-tuning but does not significantly enhance the overall quality of Icelandic language usage. Discrepancies between automated metrics and human evaluations further underscore the importance of qualitative assessment in developing language models for the legal domain.

pdf bib abs
Question-parsing with Abstract Meaning Representation enhanced by adding small datasets
Johannes Heinecke | Maria Boritchev | Frédéric Herledan

Abstract Meaning Representation (AMR) is a graph-based formalism for representing meaning in sentences. As the annotation is quite complex, few annotated corpora exist. The most well-known and widely-used corpora are LDC’s AMR 3.0 and the datasets available on the new AMR website. Models trained on the LDC corpora work fine on texts with similar genre and style: sentences extracted from news articles, Wikipedia articles. However, other types of texts, in particular questions, are less well processed by models trained on this data. We analyse how adding few sentence-type specific annotations can steer the model to improve parsing in the case of questions in English.

pdf bib abs
FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering
Erik Henriksson | Otto Tarkka | Filip Ginter

Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.

pdf bib abs
Margins in Contrastive Learning: Evaluating Multi-task Retrieval for Sentence Embeddings
Tollef Emil Jørgensen | Jens Breitung

This paper explores retrieval with sentence embeddings by fine-tuning sentence-transformer models for classification while preserving their ability to capture semantic similarity. To evaluate this balance, we introduce two opposing metrics – polarity score and semantic similarity score – that measure the model’s capacity to separate classes and retain semantic relationships between sentences. We propose a system that augments supervised datasets with contrastive pairs and triplets, training models under various configurations and evaluating their performance on top-k sentence retrieval. Experiments on two binary classification tasks demonstrate that reducing the margin parameter of loss functions greatly mitigates the trade-off between the metrics. These findings suggest that a single fine-tuned model can effectively handle joint classification and retrieval tasks, particularly in low-resource settings, without relying on multiple specialized models.

pdf bib abs
Database of Latvian Morphemes and Derivational Models: ideas and expected results
Andra Kalnača | Tatjana Pakalne | Kristīne Levāne-Petrova

In this paper, we describe “The Database of Latvian Morphemes and Derivational Models” – a large-scale corpus-based and manually validated database of Latvian derivational morphology currently in development at the University of Latvia. The database contains morpheme-level data – morphemes, incl. morpheme variants (allomorphs), morpheme types, morpheme homonymy/ homography resolu- tion, hierarchical relations between root morphemes, links to word families, and lemma-level data – incl. base form, morphemic segmentation, POS, grammatical features, derivational motivation (incl. compounding), word-family membership. The focus of the database is on providing linguistically accurate comprehensive data as a reliable basis for future work in different fields.

pdf bib abs
Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States
Jurgita Kapočiūtė-Dzikienė | Toms Bergmanis | Mārcis Pinnis

Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defense, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight large language models support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama 3, Gemma 2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma 2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.

pdf bib abs
How Aunt-Like Are You? Exploring Gender Bias in the Genderless Estonian Language: A Case Study
Elisabeth Kaukonen | Ahmed Sabir | Rajesh Sharma

This paper examines gender bias in Estonian, a grammatically genderless Finno-Ugric language, which doesn’t have gendered noun system nor any gendered pronouns, but expresses gender through vocabulary. In this work, we focus on the male-female compound words ending with -tädi ‘aunt’ and -onu ‘uncle’, aiming to pinpoint the occupations these words signify for women and men, and to examine whether they reveal occupational differentiation and gender stereotypes. The findings indicate that these compounds go beyond occupational titles and highlight prevalent gender bias.

This paper presents the development and evaluation of an Estonian isolated-word text-to-speech (TTS) synthesiser. Unlike conventional TTS systems that convert continuous text into speech, this system focuses on the synthesis of isolated words, which is crucial for applications such as pronunciation training, speech therapy, and (learners’) dictionaries. The system addresses two key challenges: generating natural prosody for isolated words and context-free disambiguation of homographs. We conducted a perception test to evaluate the performance of the TTS system in terms of pronunciation accuracy. We used 16 pairs of homographs that differ in palatalisation and 16 pairs of homographs that differ in quantity. Given that all the test items were correctly recognised by a majority of the evaluators, the performance of the synthesiser can be considered very good.

In this study, we introduce the process for creating BiaSWE, an expert-annotated dataset tailored for misogyny detection in the Swedish language. To address the cultural and linguistic specificity of misogyny in Swedish, we collaborated with experts from the social sciences and humanities. Our interdisciplinary team developed a rigorous annotation process, incorporating both domain knowledge and language expertise, to capture the nuances of misogyny in a Swedish context. This methodology ensures that the dataset is not only culturally relevant but also aligned with broader efforts in bias detection for low-resource languages. The dataset, along with the annotation guidelines, is publicly available for further research.

pdf bib abs
Predictability of Microsyntactic Units across Slavic Languages: A translation-based Study
Maria Kunilovskaya | Iuliia Zaitova | Wei Xue | Irina Stenger | Tania Avgustinova

The paper presents the results of a free translation experiment, which was set up to explore Slavic cross-language intelligibility. In the experiment, native speakers of Russian were asked to read a sentence in one of the five Slavic languages and return a Russian translation of a highlighted item. The experiment is focused on microsyntactic units because they offer an increased intercomprehension difficulty due to opaque semantics. Each language is represented by at least 50 stimuli, and each stimulus has generated at least 20 responses. The levels of intercomprehension are captured by categorising participants’ responses into seven types of translation solutions (paraphrase, correct, fluent_literal, awkward_literal, fantasy, noise, and empty), generally reflecting the level of the cross-linguistic intelligibility of the stimuli. The study aims to reveal linguistic factors that favour intercomprehension across Slavic languages. We use regression and correlation analysis to identify the most important intercomprehension predictors and statistical analysis to bring up the most typical cases and outliers. We explore several feature types that reflect the properties of the translation tasks and their outcomes, including point-wise phonological and orthographic distances, cosine similarities, surprisals, translation quality scores and translation solution entropy indices. The experimental data confirms the expected gradual increase of intelligibility from West-Slavic to East-Slavic languages for the speakers of Russian. We show that intelligibility is highly contingent on the ability of speakers to recognise and interpret formal similarities between languages as well as on the size of these similarities. For several Slavic languages, the context sentence complexity was a significant predictor of intelligibility.

pdf bib abs
Train More Parameters But Mind Their Placement: Insights into Language Adaptation with PEFT
Jenny Kunz

Smaller LLMs still face significant challenges even in medium-resourced languages, particularly when it comes to language-specific knowledge – a problem not easily resolved with machine-translated data. In this case study on Icelandic, we aim to enhance the generation performance of an LLM by specialising it using unstructured text corpora. A key focus is on preventing interference with the models’ capabilities of handling longer context during this adaptation. Through ablation studies using various parameter-efficient fine-tuning (PEFT) methods and setups, we find that increasing the number of trainable parameters leads to better and more robust language adaptation. LoRAs placed in the feed-forward layers and bottleneck adapters show promising results with sufficient parameters, while prefix tuning and (IA)³ are not suitable. Although improvements are consistent in 0-shot summarisation, some adapted models struggle with longer context lengths, an issue that can be mitigated by adapting only the final layers.

This introduces SweSAT-1.0, a new benchmark dataset created from the Swedish university entrance exam (Högskoleprovet) to assess large language models in Swedish. The current version of the benchmark includes 867 questions across six different tasks, including reading comprehension, mathematical problem solving, and logical reasoning. We find that some widely used open-source and commercial models excel in verbal tasks, but we also see that all models, even the commercial ones, struggle with reasoning tasks in Swedish. We hope that SweSAT-1.0 will facilitate research on large language models for Swedish by enriching the breadth of available tasks, offering a challenging evaluation benchmark that is free from any translation biases.

pdf bib abs
How Well do LLMs know Finno-Ugric Languages? A Systematic Assessment
Hele-Andra Kuulmets | Taido Purason | Mark Fishel

We present a systematic evaluation of multilingual capabilities of open large language models (LLMs), specifically focusing on five Finno-Ugric (FiU) languages. Our investigation covers multiple prompting strategies across several benchmarks and reveals that Llama-2 7B and Llama-2 13B perform weakly on most FiU languages. In contrast, Llama 3.1 models show impressive improvements, even for extremely low-resource languages such as Võro and Komi, indicating successful cross-lingual knowledge transfer inside the models. Finally, we show that stronger base models outperform weaker, language-adapted models, thus emphasizing the importance of base model in successful language adaptation.

pdf bib abs
Mapping Faroese in the Multilingual Representation Space: Insights for ASR Model Optimization
Dávid í Lág | Barbara Scalvini | Jon Gudnason

ASR development for low-resource languages like Faroese faces significant challenges due to the scarcity of large, diverse datasets. While fine-tuning multilingual models using related languages is a common practice, there is no standardized method for selecting these auxiliary languages, leading to a computationally expensive trial-and-error process. By analyzing Faroese’s positioning among other languages in wav2vec2’s multilingual representation space, we find that Faroese’s closest neighbors are influenced not only by linguistic similarity but also by historical, phonetic, and cultural factors. These findings open new avenues for auxiliary language selection to improve Faroese ASR and underscore the potential value of data-driven factors in ASR fine-tuning.

In this paper we describe the implementation of the first structured resource of semantic derivational links for Latvian, basing it on the largest online dictionary Tēzaurs.lv and linking it to the Latvian WordNet. We separate two kinds of derivational links: semantic derivation links between senses and morphological derivation links between lexemes. The semantic links between senses are defined as a pair of semantic labels assigned to both ends of the link. The process of semantic linking involves revising the sense inventory of both the base word and the derivative, defining semantic labels for lexemes of four basic word classes – nouns, verbs, adjectives and adverbs, and adding the appropriate labels to the corresponding senses. We exemplify our findings with a detailed representation of sense relations between a base verb and its nominal derivatives.

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

pdf bib abs
Can summarization approximate simplification? A gold standard comparison
Giacomo Magnifico | Eduard Barbu

This study explores the overlap between text summarization and simplification outputs. While summarization evaluation methods are streamlined, simplification lacks cohesion, prompting the question: how closely can abstractive summarization resemble gold-standard simplification? We address this by applying two BART-based BRIO summarization methods to the Newsela corpus, comparing outputs with manually annotated simplifications and achieving a top ROUGE-L score of 0.654. This provides insight into where summarization and simplification outputs converge and differ.

pdf bib abs
A Comparative Study of PEFT Methods for Python Code Generation
Johanna Männistö | Joseph Attieh | Jörg Tiedemann

Fine-tuning language models incurs high costs in training, inference and storage. Parameter-efficient fine-tuning (PEFT) methods have emerged as a more cost-effective alternative to full fine-tuning. However, limited work has compared different PEFT approaches for tasks like code generation. In this study, we examine the effect of various PEFT training methods on model performance in the task of Python code generation. We fine-tune four model families, ranging from 124M to 7B parameters, using three PEFT approaches alongside standard full fine-tuning. Our findings reveal that the effectiveness of each PEFT method varies with the model size and the corpus used.

pdf bib abs
A Collection of Question Answering Datasets for Norwegian
Vladislav Mikhailov | Petter Mæhlum | Victoria Ovedie Chruickshank Langø | Erik Velldal | Lilja Øvrelid

This paper introduces a new suite of question answering datasets for Norwegian; NorOpenBookQA, NorCommonSenseQA, NorTruthfulQA, and NRK-Quiz-QA. The data covers a wide range of skills and knowledge domains, including world knowledge, commonsense reasoning, truthfulness, and knowledge about Norway. Covering both of the written standards of Norwegian – Bokmål and Nynorsk – our datasets comprise over 10k question-answer pairs, created by native speakers. We detail our dataset creation approach and present the results of evaluating 11 language models (LMs) in zero- and few-shot regimes. Most LMs perform better in Bokmål than Nynorsk, struggle most with commonsense reasoning, and are often untruthful in generating answers to questions. All our datasets and annotation materials are publicly available.

pdf bib abs
Incorporating Target Fuzzy Matches into Neural Fuzzy Repair
Tommi Nieminen | Jörg Tiedemann | Sami Virpioja

Neural fuzzy repair (NFR) is a simple implementation of retrieval-augmented translation (RAT), based on data augmentation. In NFR, a translation database is searched for translation examples where the source sentence is similar to the sentence being translated, and the target side of the example is concatenated with the source sentences. We experiment with introducing retrieval that is based on target similarity to NFR during training. The results of our experiments confirm that including target similarity matches during training supplements source similarity matches and leads to better translations at translation time.

pdf bib abs
Constructions and Strategies in Universal Dependencies
Joakim Nivre

Is the framework of Universal Dependencies (UD) compatible with findings from linguistic typology? One way to find out is to investigate whether UD can adequately represent constructions of the world’s languages, as described in William Croft’s recent book Morphosyntax. This paper discusses how such an investigation could be carried out and why it would be useful.

pdf bib abs
Finnish SQuAD: A Simple Approach to Machine Translation of Span Annotations
Emil Nuutinen | Iiro Rastas | Filip Ginter

We apply a simple method to machine translate datasets with span-level annotation using the DeepL MT service and its ability to translate formatted documents. Using this method, we produce a Finnish version of the SQuAD2.0 question answering dataset and train QA retriever models on this new dataset. We evaluate the quality of the dataset and more generally the MT method through direct evaluation, indirect comparison to other similar datasets, a backtranslation experiment, as well as through the performance of downstream trained QA models. In all these evaluations, we find that the method of transfer is not only simple to use but produces consistently better translated data. Given its good performance on the SQuAD dataset, it is likely the method can be used to translate other similar span-annotated datasets for other tasks and languages as well. All code and data is available under an open license: data at HuggingFace TurkuNLP/squad_v2_fi, code on GitHub TurkuNLP/squad2-fi, and model at HuggingFace TurkuNLP/bert-base-finnish-cased-squad2.

pdf bib abs
How to Tune a Multilingual Encoder Model for Germanic Languages: A Study of PEFT, Full Fine-Tuning, and Language Adapters
Romina Oji | Jenny Kunz

This paper investigates the optimal use of the multilingual encoder model mDeBERTa for tasks in three Germanic languages – German, Swedish, and Icelandic – representing varying levels of presence and likely data quality in mDeBERTas pre-training data. We compare full fine-tuning with the parameter-efficient fine-tuning (PEFT) methods LoRA and Pfeiffer bottleneck adapters, finding that PEFT is more effective for the higher-resource language, German. However, results for Swedish and Icelandic are less consistent. We also observe differences between tasks: While PEFT tends to work better for question answering, full fine-tuning is preferable for named entity recognition. Inspired by previous research on modular approaches that combine task and language adapters, we evaluate the impact of adding PEFT modules trained on unstructured text, finding that this approach is not beneficial.

pdf bib abs
Match ‘em: Multi-Tiered Alignment for Error Analysis in ASR
Phoebe Parsons | Knut Kvale | Torbjørn Svendsen | Giampiero Salvi

We introduce “Match ‘em”: a new framework for aligning output from automatic speech recognition (ASR) with reference transcriptions. This allows a more detailed analysis of errors produced by end-to-end ASR systems compared to word error rate (WER). Match ‘em performs the alignment on both the word and character level; each relying on information from the other to provide the most meaningful global alignment. At the character level, we define a speech production motivated character similarity metric. At the word level, we rely on character similarities to define word similarity and, additionally, we reconcile compounding (insertion or deletion of spaces). We evaluated Match ‘em on transcripts of three European languages produced by wav2vec2 and Whisper. We show that Match ‘em results in more similar word substitution pairs and that compound reconciling can capture a broad range of spacing errors. We believe Match ‘em to be a valuable tool for ASR error analysis across many languages.

pdf bib abs
Adding Metadata to Existing Parliamentary Speech Corpus
Phoebe Parsons | Per Erik Solberg | Knut Kvale | Torbjørn Svendsen | Giampiero Salvi

Parliamentary proceedings are convenient data sources for creating corpora for speech technology. Given its public nature, there is an abundance of extra information about the speakers that can be legally and ethically harvested to enrich this kind of corpora. This paper describes the methods we have used to add speaker metadata to the Stortinget Speech Corpus (SSC) containing over 5,000 hours of Norwegian speech with non-verbatim transcripts but without speaker metadata. The additional metadata for each speech segment includes speaker ID, gender, date of birth, municipality of birth, and counties represented. We also infer speaker dialect from their municipality of birth using a manually designed mapping between municipalities and Norwegian dialects. We provide observations on the SSC data and give suggestions for how it may be used for tasks other than speech recognition. Finally, we demonstrate the utility of this new metadata through a dialect identification task. The described methods can be adapted to add metadata information to parliamentary corpora in other languages.

pdf bib abs
Paragraph-Level Machine Translation for Low-Resource Finno-Ugric Languages
Dmytro Pashchenko | Lisa Yankovskaya | Mark Fishel

We develop paragraph-level machine translation for four low-resource Finno-Ugric languages: Proper Karelian, Livvi, Ludian, and Veps. The approach is based on sentence-level pre-trained translation models, which are fine-tuned with paragraph-parallel data. This allows the resulting model to develop a native ability to handle discource-level phenomena correctly, in particular translating from grammatically gender-neutral input in Finno-Ugric languages. We collect monolingual and parallel paragraph-level corpora for these languages. Our experiments show that paragraph-level translation models can translate sentences no worse than sentence-level systems, while handling discourse-level phenomena better. For evaluation, we manually translate part of FLORES-200 into these four languages. All our results, data, and models are released openly.

In this study, we examine how well Danish culture-specific metaphors are explained by two of the best performing language models for Danish, namely ChatGPT and Llama. For comparison, the explana- tions are measured against how well cross- lingual (or ’universal’) metaphors are ex- plained by the models; referring here to metaphors that exist in Danish as well as across cultures and languages and in par- ticular in English. To perform our study, we compile a pilot dataset of 150 Danish metaphors and idioms divided tentatively by culture specificity. We prompt the two models and perform a careful qualitative evaluation of the explanations against a four-graded scale. Our studies show that both models are heavily biased towards English since they have much more suc- cess in explaining the metaphors that also exist in English than the culture-specific ones, relying presumably on erroneous transfer from English when dealing with the latter. In particular, the sentiment of the culture-specific metaphors seems to be often ’lost in translation’. We further claim that this strong colouring towards English poses a serious problem in the era of LLMs with regards to developing and maintaining cultural and linguistic diver- sity in other languages.

pdf bib abs
Tokenization on Trial: The Case of Kalaallisut–Danish Legal Machine Translation
Esther Ploeger | Paola Saucedo | Johannes Bjerva | Ross Deans Kristensen-McLachlan | Heather Lent

The strengths of subword tokenization have been widely demonstrated when applied to higher-resourced, morphologically simple languages. However, it is not self-evident that these results transfer to lower-resourced, morphologically complex languages. In this work, we investigate the influence of different subword segmentation techniques on machine translation between Danish and Kalaallisut, the official language of Greenland. We present the first semi-manually aligned parallel corpus for this language pair, and use it to compare subwords from unsupervised tokenizers and morphological segmenters. We find that Unigram-based segmentation both preserves morphological boundaries and handles out-of-vocabulary words adequately, but that this does not directly correspond to superior translation quality. We hope that our findings lay further groundwork for future efforts in neural machine translation for Kalaallisut.

pdf bib abs
The Roles of English in Evaluating Multilingual Language Models
Wessel Poelman | Miryam de Lhoneux

Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from these imprecise methods and instead focus on language understanding.

pdf bib abs
Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource Languages
Andrei Politov | Oleh Shkalikov | Rene Jäkel | Michael Färber

Cross-lingual Named Entity Recognition (NER) leverages knowledge transfer between languages to identify and classify named entities, making it particularly useful for low-resource languages. We show that the data-based cross-lingual transfer method is an effective technique for cross-lingual NER and can outperform multi-lingual language models for low-resource languages. This paper introduces two key enhancements to the annotation projection step in cross-lingual NER for low-resource languages. First, we explore refining word alignments using back-translation to improve accuracy. Second, we present a novel formalized projection approach of matching source entities with extracted target candidates. Through extensive experiments on two datasets spanning 57 languages, we demonstrated that our approach surpasses existing projection-based methods in low-resource settings. These findings highlight the robustness of projection-based data transfer as an alternative to model-based methods for cross-lingual named entity recognition in low-resource languages.

pdf bib abs
Empathy vs Neutrality: Designing and Evaluating a Natural Chatbot for the Healthcare Domain
Cristina Reguera-Gómez | Denis Paperno | Maaike H. T. de Boer

As lifestyle-related diseases rise due to unhealthy habits such as smoking, poor diet, lack of exercise, and alcohol consumption, the role of Conversational AI in healthcare is increasingly significant. This study provides an empirical study on the design and evaluation of a natural and intuitive healthcare chatbot, specifically focusing on the impact of empathetic responses on user experience regarding lifestyle changes. Findings reveal a strong preference for the empathetic chatbot, with results showing statistical significance (p <0.001), highlighting the importance of empathy in enhancing user interaction with healthcare chatbots.

pdf bib abs
Assessed and Annotated Vowel Lengths in Spoken Icelandic Sentences for L1 and L2 Speakers: A Resource for Pronunciation Training
Caitlin Laura Richter | Kolbrún Friðriksdóttir | Kormákur Logi Bergsson | Erik Anders Maher | Ragnheiður María Benediktsdóttir | Jon Gudnason

We introduce a dataset of time-aligned phonetic transcriptions focusing on vowel length (quantity) in Icelandic. Ultimately, this aims to support computer assisted pronunciation training (CAPT) software, to automatically assess length and possible errors in Icelandic learners’ pronunciations. The dataset contains a range of long and short vowel targets, including the first acoustic description of quantity in non-native Icelandic. Evaluations assess how manual annotations and automatic forced alignment characterise quantity contrasts. Initial analyses also imply partial acquisition of phonologically conditioned quantity alternations by non-native speakers.

pdf bib abs
The BRAGE Benchmark: Evaluating Zero-shot Learning Capabilities of Large Language Models for Norwegian Customer Service Dialogues
Mike Riess | Tollef Emil Jørgensen

This study explores the capabilities of open-weight Large Language Models in a zero-shot learning setting, testing their ability to classify the content of customer service dialogues in Norwegian from a single instruction, named the BRAGE benchmark. By comparing results against widely used downstream tasks such as question-answering and named entity recognition, we find that (1) specific instruction models greatly exceed base models on the benchmark, (2) both English and multilingual instruction models outperform the tested Norwegian models of similar sizes, and (3) the difference between base and instruction models is less pronounced than in other generative tasks, suggesting that BRAGE is a challenging benchmark, requiring precise and generalizable instruction-tuning.

pdf bib abs
Mixed Feelings: Cross-Domain Sentiment Classification of Patient Feedback
Egil Rønningstad | Lilja Charlotte Storset | Petter Mæhlum | Lilja Øvrelid | Erik Velldal

Sentiment analysis of patient feedback from the public health domain can aid decision makers in evaluating the provided services. The current paper focuses on free-text comments in patient surveys about general practitioners and psychiatric healthcare, annotated with four sentence-level polarity classes - positive, negative, mixed and neutral - while also attempting to alleviate data scarcity by leveraging general-domain sources in the form of reviews. For several different architectures, we compare in-domain and out-of-domain effects, as well as the effects of training joint multi-domain models.

The use of copyrighted materials in training language models raises critical legal and ethical questions. This paper presents a framework for and the results of empirically assessing the impact of publisher-controlled copyrighted corpora on the performance of generative large language models (LLMs) for Norwegian. When evaluated on a diverse set of tasks, we found that adding both books and newspapers to the data mixture of LLMs tend to improve their performance, while the addition of fiction works seems to be detrimental. Our experiments could inform the creation of a compensation scheme for authors whose works contribute to AI development.

pdf bib abs
Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks
Dan Saattrup Nielsen | Kenneth Enevoldsen | Peter Schneider-Kamp

This paper explores the performance of encoder and decoder language models on multilingual Natural Language Understanding (NLU) tasks, with a broad focus on Germanic languages. Building upon the ScandEval benchmark, initially restricted to evaluating encoder models, we extend the evaluation framework to include decoder models. We introduce a method for evaluating decoder models on NLU tasks and apply it to the languages Danish, Swedish, Norwegian, Icelandic, Faroese, German, Dutch, and English. Through a series of experiments and analyses, we also address research questions regarding the comparative performance of encoder and decoder models, the impact of NLU task types, and the variation across language resources. Our findings reveal that encoder models can achieve significantly better NLU performance than decoder models despite having orders of magnitude fewer parameters. Additionally, we investigate the correlation between decoders and task performance via a UMAP analysis, shedding light on the unique capabilities of decoder and encoder models. This study contributes to a deeper understanding of language model paradigms in NLU tasks and provides valuable insights for model selection and evaluation in multilingual settings.

Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern Sámi. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokmål, Nynorsk, and Northern Sámi with 11.4 billion parameters: NorMistral-11B.

pdf bib abs
Rethinking Low-Resource MT: The Surprising Effectiveness of Fine-Tuned Multilingual Models in the LLM Age
Barbara Scalvini | Iben Nyholm Debess | Annika Simonsen | Hafsteinn Einarsson

This study challenges the current paradigm shift in machine translation, where large language models (LLMs) are gaining prominence over traditional neural machine translation models, with a focus on English-to-Faroese translation. We compare the performance of various models, including fine-tuned multilingual models, LLMs (GPT-SW3, Llama 3.1), and closed-source models (Claude 3.5, GPT-4). Our findings show that a fine-tuned NLLB model outperforms most LLMs, including some larger models, in both automatic and human evaluations. We also demonstrate the effectiveness of using LLM-generated synthetic data for fine-tuning. While closed-source models like Claude 3.5 perform best overall, the competitive performance of smaller, fine-tuned models suggests a more nuanced approach to low-resource machine translation. Our results highlight the potential of specialized multilingual models and the importance of language-specific knowledge. We discuss implications for resource allocation in low-resource settings and suggest future directions for improving low-resource machine translation, including targeted data creation and more comprehensive evaluation methodologies.

pdf bib abs
Prompt Engineering Enhances Faroese MT, but Only Humans Can Tell
Barbara Scalvini | Annika Simonsen | Iben Nyholm Debess | Hafsteinn Einarsson

This study evaluates GPT-4’s English-to-Faroese translation capabilities, comparing it with multilingual models on FLORES-200 and Sprotin datasets. We propose a prompt optimization strategy using Semantic Textual Similarity (STS) to improve translation quality. Human evaluation confirms the effectiveness of STS-based few-shot example selection, though automated metrics fail to capture these improvements. Our findings advance LLM applications for low-resource language translation while highlighting the need for better evaluation methods in this context.

pdf bib abs
Interactive maps for corpus-based dialectology
Yves Scherrer | Olli Kuparinen

Traditional data collection methods in dialectology rely on structured surveys, whose results can be easily presented on printed or digital maps. But in recent years, corpora of transcribed dialect speech have become a precious alternative data source for data-driven linguistic analysis. For example, topic models can be advantageously used to discover both general dialectal variation patterns and specific linguistic features that are most characteristic for certain dialects. Multilingual (or rather, multilectal) language modeling tasks can also be used to learn speaker-specific embeddings. In connection with this paper, we introduce a website that presents the results of two recent studies in the form of interactive maps, allowing visitors to explore the effects of various parameter settings. The website covers two tasks (topic models and speaker embeddings) and three language areas (Finland, Norway, and German-speaking Switzerland). It is available at https://www.corcodial.net/ .

pdf bib abs
Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings
Carolin M. Schuster | Maria-Alexandra Roman | Shashwat Ghatiwala | Georg Groh

Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

pdf bib abs
Entailment Progressions: A Robust Approach to Evaluating Reasoning Within Larger Discourse
Rishabh Shastry | Patricia Chiril | Joshua Charney | David Uminsky

Textual entailment, or the ability to deduce whether a proposed hypothesis is logically supported by a given premise, has historically been applied to the evaluation of language modelling efficiency in tasks like question answering and text summarization. However, we hypothesize that these zero-shot entailment evaluations can be extended to the task of evaluating discourse within larger textual narratives. In this paper, we propose a simple but effective method that sequentially evaluates changes in textual entailment between sentences within a larger text, in an approach we denote as “Entailment Progressions”. These entailment progressions aim to capture the inference relations between sentences as an underlying component capable of distinguishing texts generated from various models and procedures. Our results suggest that entailment progressions can be used to effectively distinguish between machine-generated and human-authored texts across multiple established benchmark corpora and our own EP4MGT dataset. Additionally, our method displays robustness in performance when evaluated on paraphrased texts a technique that has historically affected the performance of well-established metrics when distinguishing between machine generated and human authored texts.

pdf bib abs
Generative AI for Technical Writing: Comparing Human and LLM Assessments of Generated Content
Karen de Souza | Alexandre Nikolaev | Maarit Koponen

Large language models (LLMs) have recently gained significant attention for their capabilities in natural language processing (NLP), particularly generative artificial intelligence (AI). LLMs can also be useful tools for software documentation technical writers. We present an assessment of technical documentation content generated by three different LLMs using retrieval-augmented technology (RAG) with product documentation as a knowledge base. The LLM-generated responses were analyzed in three ways: 1) manual error analysis by a technical writer, 2) automatic assessment using deterministic metrics (BLEU, ROUGE, token overlap), and 3) evaluation of correctness by LLM as a judge. The results of these assessments were compared using a Network Analysis and linear regression models to investigate statistical relationships, model preferences, and the distribution of human and LLM scores. The analyses concluded that human quality evaluation is more related to the LLM correctness judgment than deterministic metrics, even when using different analysis frameworks.

pdf bib abs
MC-19: A Corpus of 19th Century Icelandic Texts
Steinþór Steingrímsson | Einar Freyr Sigurðsson | Atli Jasonarson

We present MC-19, a new Icelandic historical corpus containing texts from the period 1800-1920. We describe approaches for enhancing a corpus of historical texts, by preparing the texts so that they can be processed using state-of-the-art NLP tools. We train encoder-decoder models to reduce the number of OCR errors while leaving other orthographical variation be. We generate a separate modern spelling layer by normalizing the spelling to comply with modern spelling rules, using a statistical modernization ruleset as well as a dictionary of the most common words. This allows for the texts to be PoS-tagged and lemmatized using available tools, facilitating usage of the corpus for researchers and language technologists. The published version of the corpus contains over 270 million tokens.

pdf bib abs
Surface-Level Morphological Segmentation of Low-resource Inuktitut Using Pre-trained Large Language Models
Mathias Stenlund | Hemanadhan Myneni | Morris Riedel

Segmenting languages based on morpheme boundaries instead of relying on language independent segmenting algorithms like Byte-Pair Encoding (BPE) has shown to benefit downstream Natural Language Processing (NLP) task performance. This can however be tricky for polysynthetic languages like Inuktitut due to a high morpheme-to-word ratio and the lack of appropriately sized annotated datasets. Through our work, we display the potential of using pre-trained Large Language Models (LLMs) for surface-level morphological segmentation of Inuktitut by treating it as a binary classification task. We fine-tune on tasks derived from automatically annotated Inuktitut words written in Inuktitut syllabics. Our approach shows good potential when compared to previous neural approaches. We share our best model to encourage further studies on down stream NLP tasks for Inuktitut written in syllabics.

pdf bib abs
The Devil’s in the Details: the Detailedness of Classes Influences Personal Information Detection and Labeling
Maria Irena Szawerna | Simon Dobnik | Ricardo Muñoz Sánchez | Elena Volodina

In this paper, we experiment with the effect of different levels of detailedness or granularity—understood as i) the number of classes, and ii) the classes’ semantic depth in the sense of hypernym and hyponym relations — of the annotation of Personally Identifiable Information (PII) on automatic detection and labeling of such information. We fine-tune a Swedish BERT model on a corpus of Swedish learner essays annotated with a total of six PII tagsets at varying levels of granularity. We also investigate whether the presence of grammatical and lexical correction annotation in the tokens and class prevalence have an effect on predictions. We observe that the fewer total categories there are, the better the overall results are, but having a more diverse annotation facilitates fewer misclassifications for tokens containing correction annotation. We also note that the classes’ internal diversity has an effect on labeling. We conclude from the results that while labeling based on the detailed annotation is difficult because of the number of classes, it is likely that models trained on such annotation rely more on the semantic content captured by contextual word embeddings rather than just the form of the tokens, making them more robust against nonstandard language.

pdf bib abs
Braxen 1.0
Christina Tånnander | Jens Edlund

With this paper, we release a Swedish pronunciation lexicon resource, Braxen 1.0, which is the result of almost 20 years development carried out at the Swedish Agency for Accessible Media (MTM). The lexicon originated with a basic word list, but has continuously been exanded with new entries, mainly acquired from university textbooks and news text. Braxen consists of around 850 000 entries, of which around 150 000 are proper names. The lexicon is released under the CC BY 4.0 license and is accessible for public use.

pdf bib abs
Temporal Relation Classification: An XAI Perspective
Sofia Elena Terenziani

Temporal annotations are used to identify and mark up temporal information, offering definition into how it is expressed through linguistic properties in text. This study investigates various discriminative pre-trained language models of differing sizes on a temporal relation classification task. We define valid reasoning strategies based on the linguistic principles that guide commonly used temporal annotations. Using a combination of saliency-based and counterfactual explanations, we examine if the models’ decisions are in line with these strategies. Our findings suggest that the selected models do not rely on the expected linguistic cues for processing temporal information effectively.

pdf bib abs
Benchmarking Abstractive Summarisation: A Dataset of Human-authored Summaries of Norwegian News Articles
Samia Touileb | Vladislav Mikhailov | Marie Ingeborg Kroka | Lilja Øvrelid | Erik Velldal

We introduce a dataset of high-quality human-authored summaries of news articles in Norwegian. The dataset is intended for benchmarking of the abstractive summarisation capabilities of generative language models. Each document in the dataset is provided with three different candidate gold-standard summaries written by native Norwegian speakers and all summaries are provided in both of the written variants of Norwegian – Bokmål and Nynorsk. The paper describes details on the data creation effort as well as an evaluation of existing open LLMs for Norwegian on the dataset. We also provide insights from a manual human evaluation, comparing human-authored to model generated summaries. Our results indicate that the dataset provides a challenging LLM benchmark for Norwegian summarisation capabilities.

pdf bib abs
Efficient Elicitation of Fictitious Nursing Notes from Volunteer Healthcare Professionals
Jesper Vaaben Bornerup | Christian Hardmeier

Reliable automatic solutions to extract structured information from free-text nursing notes could bring important efficiency gains in healthcare, but their development is hampered by the sensitivity and limited availability of example data. We describe a method for eliciting fictitious nursing documentation and associated structured documentation from volunteers and a resulting dataset of 397 Danish notes collected and annotated through a custom web application from 98 participating nurses. After some manual refinement, we obtained a high-quality dataset containing nurse notes with relevant entities identified. We describe the implementation and limitations of our approach as well as initial experiments in a named entity tagging setup.

Recent work has demonstrated that large language models can often generate fluent and linguistically correct text, adhering to given instructions. However, to what extent can they execute complex instructions requiring knowledge of fundamental linguistic concepts and elaborate semantic reasoning? Our study connects an established linguistic theory of paraphrasing with LLM-based practice to analyze which specific types of paraphrases LLMs can accurately produce and where they still struggle. To this end, we investigate a method of analyzing paraphrases generated by LLMs prompted with a comprehensive set of systematic linguistic instructions. We conduct a case study using GPT-4, which has shown strong performance across various language generation tasks, and we believe that other LLMs may face similar challenges in comparable scenarios. We examine GPT-4 from a linguistic perspective to explore its potential contributions to linguistic research regarding paraphrasing, systematically assessing how accurately the model generates paraphrases that adhere to specified transformation rules. Our results suggest that GPT-4 frequently prioritizes simple lexical or syntactic alternations, often disregarding the transformation guidelines if they overly complicate the primary task.

pdf bib abs
SweClinEval: A Benchmark for Swedish Clinical Natural Language Processing
Thomas Vakili | Martin Hansson | Aron Henriksson

The lack of benchmarks in certain domains and for certain languages makes it difficult to track progress regarding the state-of-the-art of NLP in those areas, potentially impeding progress in important, specialized domains. Here, we introduce the first Swedish benchmark for clinical NLP: _SweClinEval_. The first iteration of the benchmark consists of six clinical NLP tasks, encompassing both document-level classification and named entity recognition tasks, with real clinical data. We evaluate nine different encoder models, both Swedish and multilingual. The results show that domain-adapted models outperform generic models on sequence-level classification tasks, while certain larger generic models outperform the clinical models on named entity recognition tasks. We describe how the benchmark can be managed despite limited possibilities to share sensitive clinical data, and discuss plans for extending the benchmark in future iterations.

pdf bib abs
Dialectal treebanks and their relation with the standard variety: The case of East Cretan and Standard Modern Greek
Socrates Vakirtzian | Vivian Stamou | Yannis Kazos | Stella Markantonatou

We report on the development of the first treebank and parser for Eastern Cretan in the framework of Universal Dependencies (UD). Eastern Cretan is a living but under-resourced dialect of Modern Greek. We have worked on the transcription of oral material and relied on active annotation and knowledge transfer from GUD, a treebank of Standard Modern Greek. Along with its other phonological and morphosyntactic differences from Standard Modern Greek, Eastern Cretan (and other varieties of Modern Greek) makes heavy use of euphonics and voicing that have not been included in the UD annotation guidelines so far. We have provided annotation guidelines for East Cretan euphonics and voicing and included them in the models. Knowledge transfer from the treebank of Standard Modern Greek to the dialectal models helped to initiate annotation via an active annotation procedure

pdf bib abs
Danoliteracy of Generative Large Language Models
Søren Vejlgaard Holm | Lars Kai Hansen | Martin Carsten Nielsen

The language technology moonshot moment of Generative Large Language Models (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at 𝜌 ∼ 0.8 with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining 95% of scenario performance variance for GLLMs in Danish, suggesting a g factor of model consistency in language adaptation.

pdf bib abs
NorEventGen: generative event extraction from Norwegian news
Huiling You | Samia Touileb | Erik Velldal | Lilja Øvrelid

In this work, we approach event extraction from Norwegian news text using a generation-based approach which formulates the task as text-to-structure generation. We present experiments assessing the effect of different modeling configurations and provide an analysis of the model predictions and typical system errors. Finally, we apply our system to a large corpus of raw news texts and analyze the resulting distribution of event structures in a fairly representative snap-shot of the Norwegian news landscape.

pdf bib abs
SnakModel: Lessons Learned from Training an Open Danish Large Language Model
Mike Zhang | Max Müller-Eberstein | Elisa Bassignana | Rob van der Goot

We present SnakModel, a Danish large language model (LLM) based on Llama2-7B, which we continuously pre-train on 13.6B Danish words, and further tune on 3.7M Danish instructions. As best practices for creating LLMs for smaller language communities have yet to be established, we examine the effects of early modeling and training decisions on downstream performance throughout the entire training pipeline, including (1) the creation of a strictly curated corpus of Danish text from diverse sources; (2) the language modeling and instruction-tuning training process itself, including the analysis of intermediate training dynamics, and ablations across different hyperparameters; (3) an evaluation on eight language and culturally-specific tasks. Across these experiments SnakModel achieves the highest overall performance, outperforming multiple contemporary Llama2-7B-based models. By making SnakModel, the majority of our pre-training corpus, and the associated code available under open licenses, we hope to foster further research and development in Danish Natural Language Processing, and establish training guidelines for languages with similar resource constraints.

pdf bib abs
Got Compute, but No Data: Lessons From Post-training a Finnish LLM
Elaine Zosa | Ville Komulainen | Sampo Pyysalo

As LLMs gain more popularity as chatbots and general assistants, methods have been developed to enable LLMs to follow instructions and align with human preferences. These methods have found success in the field, but their effectiveness has not been demonstrated outside of high-resource languages. In this work, we discuss our experiences in post-training an LLM for instruction-following for English and Finnish. We use a multilingual LLM to translate instruction and preference datasets from English to Finnish. We perform instruction tuning and preference optimization in English and Finnish and evaluate the instruction-following capabilities of the model in both languages. Our results show that with a few hundred Finnish instruction samples we can obtain competitive performance in Finnish instruction-following. We also found that although preference optimization in English offers some cross-lingual benefits, we obtain our best results by using preference data from both languages. We release our model, datasets, and recipes under open licenses at https://huggingface.co/LumiOpen/Poro-34B-chat-OpenAssistant.