Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal


Abstract
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
Anthology ID:
2025.mrl-main.20
Volume:
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Month:
November
Year:
2025
Address:
Suzhuo, China
Editors:
David Ifeoluwa Adelani, Catherine Arnett, Duygu Ataman, Tyler A. Chang, Hila Gonen, Rahul Raja, Fabian Schmidt, David Stap, Jiayi Wang
Venues:
MRL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
285–321
Language:
URL:
https://aclanthology.org/2025.mrl-main.20/
DOI:
Bibkey:
Cite (ACL):
Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal. 2025. Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages. In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 285–321, Suzhuo, China. Association for Computational Linguistics.
Cite (Informal):
Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages (Rachamalla et al., MRL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.mrl-main.20.pdf