PersianSciQA: A New Dataset for Bridging the Language Gap in Scientific Question Answering

Safoura Aghadavoud Jolfaei, Azadeh Mohebi, Zahra Hemmat


Abstract
The shortage of specialized datasets hinders the development of Natural Language Processing (NLP) for scientific texts in low-resource languages such as Persian. To address this, we introduce PersianSciQA , a large-scale resource of 39,809 question- answer snippet pairs, each containing a question and a scientific answer snippet from a scientific engineering abstract source from IranDoc’s ‘Ganj’ repository, linked by an LLM-assigned relevance score (0-3) that measures how relevant the question is to the content of the accompanying answer snippet. The dataset was generated using a two stage prompting methodology and refined through a rigorous cleaning pipe-line, including text normalization and semantic deduplication. Human validation of 1,000 instances by two NLP researchers confirmed the dataset’s quality and a substantial LLM-human agreement (Cohen’s kappa coefficient κ=0.6642). To demonstrate its value, we establish baseline benchmarks and show that fine-tuning on PersianSciQA dramatically improves a state-of-the-art model, achieving a Spearman correlation of 0.895 on a blind test set. PersianSciQA provides a crucial new resource to facilitate research in information retrieval and question answering within the Persian scientific domain.
Anthology ID:
2025.ranlp-1.4
Volume:
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:
September
Year:
2025
Address:
Varna, Bulgaria
Editors:
Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:
32–37
Language:
URL:
https://aclanthology.org/2025.ranlp-1.4/
DOI:
Bibkey:
Cite (ACL):
Safoura Aghadavoud Jolfaei, Azadeh Mohebi, and Zahra Hemmat. 2025. PersianSciQA: A New Dataset for Bridging the Language Gap in Scientific Question Answering. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 32–37, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):
PersianSciQA: A New Dataset for Bridging the Language Gap in Scientific Question Answering (Aghadavoud Jolfaei et al., RANLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.ranlp-1.4.pdf