MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering

Ruturaj Ghatage, Aditya Ashutosh Kulkarni, Rajlaxmi Patil, Sharvi Endait, Raviraj Joshi


Abstract
Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP.
Anthology ID:
2023.icon-1.45
Volume:
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2023
Address:
Goa University, Goa, India
Editors:
Jyoti D. Pawar, Sobha Lalitha Devi
Venue:
ICON
SIG:
SIGLEX
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
497–505
Language:
URL:
https://aclanthology.org/2023.icon-1.45
DOI:
Bibkey:
Cite (ACL):
Ruturaj Ghatage, Aditya Ashutosh Kulkarni, Rajlaxmi Patil, Sharvi Endait, and Raviraj Joshi. 2023. MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), pages 497–505, Goa University, Goa, India. NLP Association of India (NLPAI).
Cite (Informal):
MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering (Ghatage et al., ICON 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.icon-1.45.pdf