Neelamadhav Gantayat
2018
SandhiKosh: A Benchmark Corpus for Evaluating Sanskrit Sandhi Tools
Shubham Bhardwaj
|
Neelamadhav Gantayat
|
Nikhil Chaturvedi
|
Rahul Garg
|
Sumeet Agarwal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Sanskrit Sandhi Splitting using seq2(seq)2
Rahul Aralikatte
|
Neelamadhav Gantayat
|
Naveen Panwar
|
Anush Sankaran
|
Senthil Mani
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
In Sanskrit, small words (morphemes) are combined to form compound words through a process known as Sandhi. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing word splitting exists in the language, it is highly challenging to identify the location of the splits in a compound word. Though existing Sandhi splitting systems incorporate these pre-defined splitting rules, they have a low accuracy as the same compound word might be broken down in multiple ways to provide syntactically correct splits. In this research, we propose a novel deep learning architecture called Double Decoder RNN (DD-RNN), which (i) predicts the location of the split(s) with 95% accuracy, and (ii) predicts the constituent words (learning the Sandhi splitting rules) with 79.5% accuracy, outperforming the state-of-art by 20%. Additionally, we show the generalization capability of our deep learning model, by showing competitive results in the problem of Chinese word segmentation, as well.
Search
Co-authors
- Shubham Bhardwaj 1
- Nikhil Chaturvedi 1
- Rahul Garg 1
- Sumeet Agarwal 1
- Rahul Aralikatte 1
- show all...