AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages

Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, Jimmy Lin


Abstract
Language diversity in NLP is critical in enabling the development of tools for a wide range of users. However, there are limited resources for building such tools for many languages, particularly those spoken in Africa.For search, most existing datasets feature few or no African languages, directly impacting researchers’ ability to build and improve information access capabilities in those languages. Motivated by this, we created AfriCLIRMatrix, a test collection for cross-lingual information retrieval research in 15 diverse African languages. In total, our dataset contains 6 million queries in English and 23 million relevance judgments automatically mined from Wikipedia inter-language links, covering many more African languages than any existing information retrieval test collection. In addition, we release BM25, dense retrieval, and sparse–dense hybrid baselines to provide a starting point for the development of future systems. We hope that these efforts can spur additional work in search for African languages.AfriCLIRMatrix can be downloaded at https://github.com/castorini/africlirmatrix.
Anthology ID:
2022.emnlp-main.597
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8721–8728
Language:
URL:
https://aclanthology.org/2022.emnlp-main.597
DOI:
10.18653/v1/2022.emnlp-main.597
Bibkey:
Cite (ACL):
Odunayo Ogundepo, Xinyu Zhang, Shuo Sun, Kevin Duh, and Jimmy Lin. 2022. AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8721–8728, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
AfriCLIRMatrix: Enabling Cross-Lingual Information Retrieval for African Languages (Ogundepo et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.597.pdf