Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages

Abdulmatin Omotoso; Habeeb Shopeju; Adejumobi Monjolaoluwa Joshua; Shiloh Oni

doi:10.18653/v1/2025.winlp-main.33

Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages

Abdulmatin Omotoso, Habeeb Shopeju, Adejumobi Monjolaoluwa Joshua, Shiloh Oni

Abstract

Multilingual dense embedding models such as Multilingual E5, LaBSE, and BGE-M3 have shown promising results on diverse benchmarks for information retrieval in low-resource languages. But their result on low resource languages is not up to par with other high resource languages. This work improves the performance of BGE-M3 through contrastive fine-tuning; the model was selected because of its superior performance over other multilingual embedding models across MIRACL, MTEB, and SEB benchmarks. To fine-tune this model, we curated a comprehensive dataset comprising Yorùbá (32.9k rows), Igbo (18k rows) and Hausa (85k rows) from mainly news sources. We further augmented our multilingual dataset with English queries and mapped it to each of the Yoruba, Igbo, and Hausa documents, enabling cross-lingual semantic training. We evaluate on two settings: the Wura test set and the MIRACL benchmark. On Wura, the fine-tuned BGE-M3 raises mean reciprocal rank (MRR) to 0.9201 for Yorùbá, 0.8638 for Igbo, 0.9230 for Hausa, and 0.8617 for English queries matched to local documents, surpassing the BGE-M3 baselines of 0.7846, 0.7566, 0.8575, and 0.7377, respectively. On MIRACL (Yorùbá subset), the fine-tuned model attains 0.5996 MRR, slightly surpassing base BGE-M3 (0.5952) and outperforming ML-E5-large (0.5632) and LaBSE (0.4468).

Anthology ID:: 2025.winlp-main.33
Volume:: Proceedings of the 9th Widening NLP Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim
Venues:: WiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 224–229
Language:
URL:: https://aclanthology.org/2025.winlp-main.33/
DOI:: 10.18653/v1/2025.winlp-main.33
Bibkey:
Cite (ACL):: Abdulmatin Omotoso, Habeeb Shopeju, Adejumobi Monjolaoluwa Joshua, and Shiloh Oni. 2025. Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages. In Proceedings of the 9th Widening NLP Workshop, pages 224–229, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages (Omotoso et al., WiNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.winlp-main.33.pdf

PDF Cite Search Fix data