SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Gayane Ghazaryan, Erik Arakelyan, Isabelle Augenstein, Pasquale Minervini


Abstract
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose SynDARin, a method for generating and validating QA datasets for low-resoucre languages. We utilize parallel content mining to obtain human-curated paragraphs between English and the target language. We use the English data as context to generate synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English human-curated paragraphs form the final QA dataset. The method allows to maintain content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with 1.2K samples for the Armenian language. The human evaluation shows that 98% of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out ~70% of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
Anthology ID:
2025.coling-main.430
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6459–6466
Language:
URL:
https://aclanthology.org/2025.coling-main.430/
DOI:
Bibkey:
Cite (ACL):
Gayane Ghazaryan, Erik Arakelyan, Isabelle Augenstein, and Pasquale Minervini. 2025. SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6459–6466, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages (Ghazaryan et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.430.pdf