Parameshwari Krishnamurthy


2023

This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multi-lingual Transformer models (m-Bert, XLM-R, IndicBERT) and mono-lingual Transformer models trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences (BERT-Te). Our findings demonstrate the efficacy of Transformer-based representations pretrained on Telugu data in improving the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. We present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on our dataset.