Neural Text Normalization for Luxembourgish Using Real-Life Variation Data

Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank


Abstract
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
Anthology ID:
2025.vardial-1.9
Volume:
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jorg Tiedemann, Marcos Zampieri
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
115–127
Language:
URL:
https://aclanthology.org/2025.vardial-1.9/
DOI:
Bibkey:
Cite (ACL):
Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, and Barbara Plank. 2025. Neural Text Normalization for Luxembourgish Using Real-Life Variation Data. In Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 115–127, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data (Lutgen et al., VarDial 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.vardial-1.9.pdf