MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling

Rob Van Der Goot; Anette Jensen; Emil Allerslev Schledermann; Mikkel Wildner Kildeberg; Nicolaj Larsen; Mike Zhang; Elisa Bassignana

MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling

Rob van der Goot, Anette Jensen, Emil Allerslev Schledermann, Mikkel Wildner Kildeberg, Nicolaj Larsen, Mike Zhang, Elisa Bassignana

Abstract

Current language models (LMs) mostly exploit subwords as input units based on statistical co-occurrences of characters. Adjacently, previous work has shown that modeling morphemes can aid performance for Natural Language Processing (NLP) models. However, morphemes are challenging to obtain as there is no annotated data in most languages. In this work, we release a wide-coverage Danish morphological segmentation evaluation set. We evaluate a range of unsupervised token segmenters and evaluate the downstream effect of using morphemes as input units for transformer-based LMs. Our results show that popular subword algorithms perform poorly on this task, scoring at most an F1 of 57.6 compared to 68.0 for an unsupervised morphological segmenter (Morfessor). Furthermore, evaluate a range of segmenters on the task of language modeling.

Anthology ID:: 2025.nodalida-1.23
Volume:: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Month:: march
Year:: 2025
Address:: Tallinn, Estonia
Editors:: Richard Johansson, Sara Stymne
Venue:: NoDaLiDa
SIG:
Publisher:: University of Tartu Library
Note:
Pages:: 223–229
Language:
URL:: https://aclanthology.org/2025.nodalida-1.23/
DOI:
Bibkey:
Cite (ACL):: Rob van der Goot, Anette Jensen, Emil Allerslev Schledermann, Mikkel Wildner Kildeberg, Nicolaj Larsen, Mike Zhang, and Elisa Bassignana. 2025. MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling. In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025), pages 223–229, Tallinn, Estonia. University of Tartu Library.
Cite (Informal):: MorSeD: Morphological Segmentation of Danish and its Effect on Language Modeling (Goot et al., NoDaLiDa 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.nodalida-1.23.pdf

PDF Cite Search Fix data