Sandhi Splitting in Tamil and Telugu: A Sequence-to-Sequence Approach Leveraging Transformer Models

Priyanka Dasari, Mupparapu Sohan Gupta, Nagaraju Vuppala, Pruthwik Mishra, Parameswari Krishnamurthy


Abstract
Dravidian languages like Tamil and Telugu are agglutinative languages, they form wordforms by combining two or more elements into a single string with morpho-phonemic changes at the point of concatenation, known as sandhi. This linguistic feature adds complexity to automatic language processing, making the pre-processing of sandhi words essential for NLP applications. We developed extensive sandhi-annotated corpora of 15K for Telugu and Tamil, focusing on the systematic application of sandhi rules which explains the word formation patterns by showing how lexical and functional categories combine to create composite non-compound words. We implemented compact sequence-to-sequence transformer networks for the automatic sandhi processing. To evaluate our models, we manually annotated Telugu and Tamil IN22-Conv Benchmark datasets with sandhi annotations. Our experiments aim to enhance the language processing tasks like machine translation in morphologically rich languages.
Anthology ID:
2025.chipsal-1.9
Volume:
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Kengatharaiyer Sarveswaran, Ashwini Vaidya, Bal Krishna Bal, Sana Shams, Surendrabikram Thapa
Venues:
CHiPSAL | WS
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
93–103
Language:
URL:
https://aclanthology.org/2025.chipsal-1.9/
DOI:
Bibkey:
Cite (ACL):
Priyanka Dasari, Mupparapu Sohan Gupta, Nagaraju Vuppala, Pruthwik Mishra, and Parameswari Krishnamurthy. 2025. Sandhi Splitting in Tamil and Telugu: A Sequence-to-Sequence Approach Leveraging Transformer Models. In Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025), pages 93–103, Abu Dhabi, UAE. International Committee on Computational Linguistics.
Cite (Informal):
Sandhi Splitting in Tamil and Telugu: A Sequence-to-Sequence Approach Leveraging Transformer Models (Dasari et al., CHiPSAL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.chipsal-1.9.pdf