TeQuAD:Telugu Question Answering Dataset

Rakesh Vemula, Mani Nuthi, Manish Srivastava


Abstract
Recent state of the art models and new datasets have advanced many Natural Language Processing areas, especially, Machine Reading Comprehension tasks have improved with the help of datasets like SQuAD (Stanford Question Answering Dataset). But, large high quality datasets are still not a reality for low resource languages like Telugu to record progress in MRC. In this paper, we present a Telugu Question Answering Dataset - TeQuAD with the size of 82k parallel triples created by translating triples from the SQuAD. We also introduce a few methods to create similar Question Answering datasets for the low resource languages. Then, we present the performance of our models which outperform baseline models on Monolingual and Cross Lingual Machine Reading Comprehension (CLMRC) setups, the best of them resulting in an F1 score of 83 % and Exact Match (EM) score of 61 %.
Anthology ID:
2022.icon-main.36
Volume:
Proceedings of the 19th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2022
Address:
New Delhi, India
Editors:
Md. Shad Akhtar, Tanmoy Chakraborty
Venue:
ICON
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
300–307
Language:
URL:
https://aclanthology.org/2022.icon-main.36
DOI:
Bibkey:
Cite (ACL):
Rakesh Vemula, Mani Nuthi, and Manish Srivastava. 2022. TeQuAD:Telugu Question Answering Dataset. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), pages 300–307, New Delhi, India. Association for Computational Linguistics.
Cite (Informal):
TeQuAD:Telugu Question Answering Dataset (Vemula et al., ICON 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.icon-main.36.pdf