Low Resource Question Answering: An Amharic Benchmarking Dataset

Tilahun Abedissa Taffa, Ricardo Usbeck, Yaregal Assabie


Abstract
Question Answering (QA) systems return concise answers or answer lists based on natural language text, which uses a given context document. Many resources go into curating QA datasets to advance the development of robust QA models. There is a surge in QA datasets for languages such as English; this is different for low-resource languages like Amharic. Indeed, there is no published or publicly available Amharic QA dataset. Hence, to foster further research in low-resource QA, we present the first publicly available benchmarking Amharic Question Answering Dataset (Amh-QuAD). We crowdsource 2,628 question-answer pairs from over 378 Amharic Wikipedia articles. Using the training set, we fine-tune an XLM-R-based language model and introduce a new reader model. Leveraging our newly fine-tuned reader run a baseline model to spark open-domain Amharic QA research interest. The best- performing baseline QA achieves an F-score of 80.3 and 81.34 in retriever-reader and reading comprehension settings.
Anthology ID:
2024.rail-1.14
Volume:
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Rooweither Mabuya, Muzi Matfunjwa, Mmasibidi Setaka, Menno van Zaanen
Venues:
RAIL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
124–132
Language:
URL:
https://aclanthology.org/2024.rail-1.14
DOI:
Bibkey:
Cite (ACL):
Tilahun Abedissa Taffa, Ricardo Usbeck, and Yaregal Assabie. 2024. Low Resource Question Answering: An Amharic Benchmarking Dataset. In Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024, pages 124–132, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Low Resource Question Answering: An Amharic Benchmarking Dataset (Taffa et al., RAIL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.rail-1.14.pdf