MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering

Ghatage Ruturaj, Kulkarni Aditya Ashutosh, Patil Rajlaxmi, Endait Sharvi, Joshi Raviraj


Abstract
Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP.
Anthology ID:
2023.icon-1.45
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
D. Pawar Jyoti, Lalitha Devi Sobha
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
497–505
Language:
URL:
https://aclanthology.org/2023.icon-1.45
DOI:
Bibkey:
Cite (ACL):
Ghatage Ruturaj, Kulkarni Aditya Ashutosh, Patil Rajlaxmi, Endait Sharvi, and Joshi Raviraj. 2023. MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 497–505, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering (Ruturaj et al., ICON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.icon-1.45.pdf