SansGPT: Advancing Generative Pre-Training in Sanskrit

Rhugved Pankaj Chaudhari, Bhakti Jadhav, Pushpak Bhattacharyya, Malhar Kulkarni


Abstract
In the past decade, significant progress has been made in digitizing Sanskrit texts and advancing computational analysis of the language. However, efforts to advance NLP for complex semantic downstream tasks like Semantic Analogy Prediction, Named Entity Recognition, and others remain limited. This gap is mainly due to the absence of a robust, pre-trained Sanskrit model built on large-scale Sanskrit text data since this demands considerable computational resources and data preparation. In this paper, we introduce SansGPT, a generative pre-trained model that has been trained on a large corpus of Sanskrit texts and is designed to facilitate fine-tuning and development for downstream NLP tasks. We aim for this model to serve as a catalyst for advancing NLP research in Sanskrit. Additionally, we developed a custom tokenizer specifically optimized for Sanskrit text, enabling effective tokenization of compound words and making it better suited for generative tasks. Our data collection and cleaning process encompassed a wide array of available Sanskrit literature, ensuring comprehensive representation for training. We further demonstrate the model’s efficacy by fine-tuning it on Semantic Analogy Prediction and Simile Element Extraction, achieving an impressive accuracy of approximately 95.8% and 92.8%, respectively.
Anthology ID:
2024.icon-1.50
Volume:
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2024
Address:
AU-KBC Research Centre, Chennai, India
Editors:
Sobha Lalitha Devi, Karunesh Arora
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
432–441
Language:
URL:
https://aclanthology.org/2024.icon-1.50/
DOI:
Bibkey:
Cite (ACL):
Rhugved Pankaj Chaudhari, Bhakti Jadhav, Pushpak Bhattacharyya, and Malhar Kulkarni. 2024. SansGPT: Advancing Generative Pre-Training in Sanskrit. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), pages 432–441, AU-KBC Research Centre, Chennai, India. NLP Association of India (NLPAI).
Cite (Informal):
SansGPT: Advancing Generative Pre-Training in Sanskrit (Chaudhari et al., ICON 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.icon-1.50.pdf