Andreea Ioana Tudor

2025

pdf bib abs
Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models
Andreea Ioana Tudor | Tsegaye Misikir Tashu
Proceedings of the First Workshop on Language Models for Low-Resource Languages

In this work, we explore different linear mapping techniques to learn cross-lingual document representations from pre-trained multilingual large language models for low-resource languages. Three different mapping techniques namely Linear Concept Approximation (LCA), Linear Concept Compression (LCC), and Neural Concept Approximation (NCA) and four multilingual language models such as mBERT, mT5, XLM-R, and ErnieM were used to extract embeddings. The inter-lingual representations were created mappings the monolingual representation extracted from multilingual language models. The experimental results showed that LCA and LCC significantly outperform NCA, with models like ErnieM achieving the highest alignment quality. Language pairs exhibit variable performance, influenced by linguistic similarity and data availability, with the Amharic-English pair yielding particularly high scores. The results showed the utility of LCA and LCC in enabling cross-lingual tasks for low-resource languages.

Co-authors

Tsegaye Misikir Tashu 1

Venues

loreslm1
ws1

Fix data