Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya

Fitsum Gaim, Wonsuk Yang, Hancheol Park, Jong Park


Abstract
Question-Answering (QA) has seen significant advances recently, achieving near human-level performance over some benchmarks. However, these advances focus on high-resourced languages such as English, while the task remains unexplored for most other languages, mainly due to the lack of annotated datasets. This work presents a native QA dataset for an East African language, Tigrinya. The dataset contains 10.6K question-answer pairs spanning 572 paragraphs extracted from 290 news articles on various topics. The dataset construction method is discussed, which is applicable to constructing similar resources for related languages. We present comprehensive experiments and analyses of several resource-efficient approaches to QA, including monolingual, cross-lingual, and multilingual setups, along with comparisons against machine-translated silver data. Our strong baseline models reach 76% in the F1 score, while the estimated human performance is 92%, indicating that the benchmark presents a good challenge for future work. We make the dataset, models, and leaderboard publicly available.
Anthology ID:
2023.acl-long.661
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11857–11870
Language:
URL:
https://aclanthology.org/2023.acl-long.661
DOI:
10.18653/v1/2023.acl-long.661
Bibkey:
Cite (ACL):
Fitsum Gaim, Wonsuk Yang, Hancheol Park, and Jong Park. 2023. Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11857–11870, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya (Gaim et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.661.pdf
Video:
 https://aclanthology.org/2023.acl-long.661.mp4