Safoura Aghadavoud Jolfaei


2025

pdf bib
PersianSciQA: A New Dataset for Bridging the Language Gap in Scientific Question Answering
Safoura Aghadavoud Jolfaei | Azadeh Mohebi | Zahra Hemmat
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The shortage of specialized datasets hinders the development of Natural Language Processing (NLP) for scientific texts in low-resource languages such as Persian. To address this, we introduce PersianSciQA , a large-scale resource of 39,809 question- answer snippet pairs, each containing a question and a scientific answer snippet from a scientific engineering abstract source from IranDoc’s ‘Ganj’ repository, linked by an LLM-assigned relevance score (0-3) that measures how relevant the question is to the content of the accompanying answer snippet. The dataset was generated using a two stage prompting methodology and refined through a rigorous cleaning pipe-line, including text normalization and semantic deduplication. Human validation of 1,000 instances by two NLP researchers confirmed the dataset’s quality and a substantial LLM-human agreement (Cohen’s kappa coefficient κ=0.6642). To demonstrate its value, we establish baseline benchmarks and show that fine-tuning on PersianSciQA dramatically improves a state-of-the-art model, achieving a Spearman correlation of 0.895 on a blind test set. PersianSciQA provides a crucial new resource to facilitate research in information retrieval and question answering within the Persian scientific domain.