Nagaraju Vuppala


2025

pdf bib
Sandhi Splitting in Tamil and Telugu: A Sequence-to-Sequence Approach Leveraging Transformer Models
Priyanka Dasari | Mupparapu Sohan Gupta | Nagaraju Vuppala | Pruthwik Mishra | Parameswari Krishnamurthy
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

Dravidian languages like Tamil and Telugu are agglutinative languages, they form wordforms by combining two or more elements into a single string with morpho-phonemic changes at the point of concatenation, known as sandhi. This linguistic feature adds complexity to automatic language processing, making the pre-processing of sandhi words essential for NLP applications. We developed extensive sandhi-annotated corpora of 15K for Telugu and Tamil, focusing on the systematic application of sandhi rules which explains the word formation patterns by showing how lexical and functional categories combine to create composite non-compound words. We implemented compact sequence-to-sequence transformer networks for the automatic sandhi processing. To evaluate our models, we manually annotated Telugu and Tamil IN22-Conv Benchmark datasets with sandhi annotations. Our experiments aim to enhance the language processing tasks like machine translation in morphologically rich languages.

2023

pdf bib
Transformer-based Context Aware Morphological Analyzer for Telugu
Priyanka Dasari | Abhijith Chelpuri | Nagaraju Vuppala | Mounika Marreddy | Parameshwari Krishnamurthy | Radhika Mamidi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multi-lingual Transformer models (m-Bert, XLM-R, IndicBERT) and mono-lingual Transformer models trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences (BERT-Te). Our findings demonstrate the efficacy of Transformer-based representations pretrained on Telugu data in improving the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. We present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on our dataset.