Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Kidist Amde Mekonnen; Yosef Worku Alemneh; Maarten de Rijke

doi:10.18653/v1/2025.findings-acl.543

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke

Abstract

Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13× smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.

Anthology ID:: 2025.findings-acl.543
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10428–10445
Language:
URL:: https://aclanthology.org/2025.findings-acl.543/
DOI:: 10.18653/v1/2025.findings-acl.543
Bibkey:
Cite (ACL):: Kidist Amde Mekonnen, Yosef Worku Alemneh, and Maarten de Rijke. 2025. Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval. In Findings of the Association for Computational Linguistics: ACL 2025, pages 10428–10445, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval (Mekonnen et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.543.pdf

PDF Cite Search Fix data