LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings

Fred Philippy; Siwen Guo; Jacques Klein; Tegawendé Bissyandé

LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings

Fred Philippy, Siwen Guo, Jacques Klein, Tegawende Bissyande

Abstract

Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages. To address this issue, we compile a relatively small but high-quality human-generated cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence embedding model for Luxembourgish with strong cross-lingual capabilities. Additionally, we present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages than relying solely on high-resource language pairs. Furthermore, recognizing the lack of sentence embedding benchmarks for low-resource languages, we create a paraphrase detection benchmark specifically for Luxembourgish, aiming to partially fill this gap and promote further research.

Anthology ID:: 2025.coling-main.753
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11369–11379
Language:
URL:: https://aclanthology.org/2025.coling-main.753/
DOI:
Bibkey:
Cite (ACL):: Fred Philippy, Siwen Guo, Jacques Klein, and Tegawende Bissyande. 2025. LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings. In Proceedings of the 31st International Conference on Computational Linguistics, pages 11369–11379, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings (Philippy et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.753.pdf

PDF Cite Search Fix data