Abhijith Chelpuri
2024
Towards Enhancing Knowledge Accessibility for Low-Resource Indian Languages: A Template Based Approach
Srijith Padakanti
|
Akhilesh Aravapalli
|
Abhijith Chelpuri
|
Radhika Mamidi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
In today’s digital age, access to knowledge and information is crucial for societal growth. Although widespread resources like Wikipedia exist, there is still a linguistic barrier to breakdown for low-resource languages. In India, millions of individuals still lack access to reliable information from Wikipedia because they are only proficient in their regional language. To address this gap, our work focuses on enhancing the content and digital footprint of multiple Indian languages. The primary objective of our work is to improve knowledge accessibility by generating a substantial volume of high-quality Wikipedia articles in Telugu, a widely spoken language in India with around 95.7 million native speakers. Our work aims to create Wikipedia articles and also ensures that each article meets necessary quality standards such as a minimum word count, inclusion of images for reference, and an infobox. Our work also adheres to the five core principles of Wikipedia. We streamline our article generation process, leveraging NLP techniques such as translation, transliteration, and template generation and incorporating human intervention when necessary. Our contribution is a collection of 8,929 articles in the movie domain, now ready to be published on Telugu Wikipedia.
2023
Transformer-based Context Aware Morphological Analyzer for Telugu
Priyanka Dasari
|
Abhijith Chelpuri
|
Nagaraju Vuppala
|
Mounika Marreddy
|
Parameshwari Krishnamurthy
|
Radhika Mamidi
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
This paper addresses the challenges faced by Indian languages in leveraging deep learning for natural language processing (NLP) due to limited resources, annotated datasets, and Transformer-based architectures. We specifically focus on Telugu and aim to construct a Telugu morph analyzer dataset comprising 10,000 sentences. Furthermore, we assess the performance of established multi-lingual Transformer models (m-Bert, XLM-R, IndicBERT) and mono-lingual Transformer models trained from scratch on an extensive Telugu corpus comprising 80,15,588 sentences (BERT-Te). Our findings demonstrate the efficacy of Transformer-based representations pretrained on Telugu data in improving the performance of the Telugu morph analyzer, surpassing existing multi-lingual approaches. This highlights the necessity of developing dedicated corpora, annotated datasets, and machine learning models in a mono-lingual setting. We present benchmark results for the Telugu morph analyzer achieved through simple fine-tuning on our dataset.