LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

Cedric Lothritz; Bertrand Lebichot; Kevin Allix; Lisa Veiber; Tegawendé Bissyandé; Jacques Klein; Andrey Boytsov; Clément Lefebvre; Anne Goujon

LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish

Cedric Lothritz, Bertrand Lebichot, Kevin Allix, Lisa Veiber, Tegawende Bissyande, Jacques Klein, Andrey Boytsov, Clément Lefebvre, Anne Goujon

Abstract

Pre-trained Language Models such as BERT have become ubiquitous in NLP where they have achieved state-of-the-art performance in most NLP tasks. While these models are readily available for English and other widely spoken languages, they remain scarce for low-resource languages such as Luxembourgish. In this paper, we present LuxemBERT, a BERT model for the Luxembourgish language that we create using the following approach: we augment the pre-training dataset by considering text data from a closely related language that we partially translate using a simple and straightforward method. We are then able to produce the LuxemBERT model, which we show to be effective for various NLP tasks: it outperforms a simple baseline built with the available Luxembourgish text data as well the multilingual mBERT model, which is currently the only option for transformer-based language models in Luxembourgish. Furthermore, we present datasets for various downstream NLP tasks that we created for this study and will make available to researchers on request.

Anthology ID:: 2022.lrec-1.543
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 5080–5089
Language:
URL:: https://aclanthology.org/2022.lrec-1.543
DOI:
Bibkey:
Cite (ACL):: Cedric Lothritz, Bertrand Lebichot, Kevin Allix, Lisa Veiber, Tegawende Bissyande, Jacques Klein, Andrey Boytsov, Clément Lefebvre, and Anne Goujon. 2022. LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5080–5089, Marseille, France. European Language Resources Association.
Cite (Informal):: LuxemBERT: Simple and Practical Data Augmentation in Language Model Pre-Training for Luxembourgish (Lothritz et al., LREC 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lrec-1.543.pdf

PDF Cite Search