SMASH at Qur’an QA 2022: Creating Better Faithful Data Splits for Low-resourced Question Answering Scenarios

Amr Keleg; Walid Magdy

SMASH at Qur’an QA 2022: Creating Better Faithful Data Splits for Low-resourced Question Answering Scenarios

Abstract

The Qur’an QA 2022 shared task aims at assessing the possibility of building systems that can extract answers to religious questions given relevant passages from the Holy Qur’an. This paper describes SMASH’s system that was used to participate in this shared task. Our experiments reveal a data leakage issue among the different splits of the dataset. This leakage problem hinders the reliability of using the models’ performance on the development dataset as a proxy for the ability of the models to generalize to new unseen samples. After creating better faithful splits from the original dataset, the basic strategy of fine-tuning a language model pretrained on classical Arabic text yielded the best performance on the new evaluation split. The results achieved by the model suggests that the small scale dataset is not enough to fine-tune large transformer-based language models in a way that generalizes well. Conversely, we believe that further attention could be paid to the type of questions that are being used to train the models given the sensitivity of the data.

Anthology ID:: 2022.osact-1.17
Volume:: Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Hend Al-Khalifa, Tamer Elsayed, Hamdy Mubarak, Abdulmohsen Al-Thubaity, Walid Magdy, Kareem Darwish
Venue:: OSACT
SIG:: SIGARAB
Publisher:: European Language Resources Association
Note:
Pages:: 136–145
Language:
URL:: https://aclanthology.org/2022.osact-1.17/
DOI:
Bibkey:
Cite (ACL):: Amr Keleg and Walid Magdy. 2022. SMASH at Qur’an QA 2022: Creating Better Faithful Data Splits for Low-resourced Question Answering Scenarios. In Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, pages 136–145, Marseille, France. European Language Resources Association.
Cite (Informal):: SMASH at Qur’an QA 2022: Creating Better Faithful Data Splits for Low-resourced Question Answering Scenarios (Keleg & Magdy, OSACT 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.osact-1.17.pdf

PDF Cite Search Fix data