Sanskrit Segmentation revisited

Sriram Krishnan, Amba Kulkarni


Abstract
Computationally analyzing Sanskrit texts requires proper segmentation in the initial stages. There have been various tools developed for Sanskrit text segmentation. Of these, Gérard Huet’s Reader in the Sanskrit Heritage Engine analyzes the input text and segments it based on the word parameters - phases like iic, ifc, Pr, Subst, etc., and sandhi (or transition) that takes place at the end of a word with the initial part of the next word. And it enlists all the possible solutions differentiating them with the help of the phases. The phases and their analyses have their use in the domain of sentential parsers. In segmentation, though, they are not used beyond deciding whether the words formed with the phases are morphologically valid. This paper tries to modify the above segmenter by ignoring the phase details (except for a few cases), and also proposes a probability function to prioritize the list of solutions to bring up the most valid solutions at the top.
Anthology ID:
2019.icon-1.12
Volume:
Proceedings of the 16th International Conference on Natural Language Processing
Month:
December
Year:
2019
Address:
International Institute of Information Technology, Hyderabad, India
Editors:
Dipti Misra Sharma, Pushpak Bhattacharya
Venue:
ICON
SIG:
Publisher:
NLP Association of India
Note:
Pages:
105–114
Language:
URL:
https://aclanthology.org/2019.icon-1.12
DOI:
Bibkey:
Cite (ACL):
Sriram Krishnan and Amba Kulkarni. 2019. Sanskrit Segmentation revisited. In Proceedings of the 16th International Conference on Natural Language Processing, pages 105–114, International Institute of Information Technology, Hyderabad, India. NLP Association of India.
Cite (Informal):
Sanskrit Segmentation revisited (Krishnan & Kulkarni, ICON 2019)
Copy Citation:
PDF:
https://aclanthology.org/2019.icon-1.12.pdf