N-gram-Based Preprocessing for Sandhi Reversion in Vedic Sanskrit

Yuzuki Tsukagoshi, Ikki Ohmukai


Abstract
This study aims to address the challenges posed by sandhi in Vedic Sanskrit, a phenomenon that complicates the computational analysis of Sanskrit texts. By focusing on sandhi reversion, the research seeks to improve the accuracy of processing Vedic Sanskrit, an older layer of the language. Sandhi, a phonological phenomenon, poses challenges for text processing in Sanskrit due to the fusion of word boundaries or the sound change around word boundaries. In this research, we developed a transformer-based model with a novel n-gram preprocessing strategy to improve the accuracy of sandhi reversion for Vedic. We created character-based n-gram texts of varying lengths (n = 2, 3, 4, 5, 6) from the Rigveda, the oldest Vedic text, and trained models on these texts to perform machine translation from post-sandhi to pre-sandhi forms. In the results, we found that the model trained with 5-gram text achieved the highest accuracy. This success is likely due to the 5-gram’s ability to capture the maximum phonemic context in which Vedic sandhi occurs, making it more effective for the task. These findings suggest that by leveraging the inherent characteristics of phonological changes in language, even simple preprocessing methods like n-gram segmentation can significantly improve the accuracy of complex linguistic tasks.
Anthology ID:
2024.nlp4dh-1.26
Volume:
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Month:
November
Year:
2024
Address:
Miami, USA
Editors:
Mika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni
Venue:
NLP4DH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
275–279
Language:
URL:
https://aclanthology.org/2024.nlp4dh-1.26
DOI:
Bibkey:
Cite (ACL):
Yuzuki Tsukagoshi and Ikki Ohmukai. 2024. N-gram-Based Preprocessing for Sandhi Reversion in Vedic Sanskrit. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 275–279, Miami, USA. Association for Computational Linguistics.
Cite (Informal):
N-gram-Based Preprocessing for Sandhi Reversion in Vedic Sanskrit (Tsukagoshi & Ohmukai, NLP4DH 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4dh-1.26.pdf