Siva Subrahamanyam Varma Kusampudi
2021
Corpus Creation and Language Identification in Low-Resource Code-Mixed Telugu-English Text
Siva Subrahamanyam Varma Kusampudi
|
Anudeep Chaluvadi
|
Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Code-Mixing (CM) is a common phenomenon in multilingual societies. CM plays a significant role in technology and medical fields where terminologies in the native language are not available or known. Language Identification (LID) of the CM data will help solve NLP tasks such as Spell Checking, Named Entity Recognition, Part-Of-Speech tagging, and Semantic Parsing. In the current era of machine learning, a common problem to the above-mentioned tasks is the availability of Learning data to train models. In this paper, we introduce two Telugu-English CM manually annotated datasets (Twitter dataset and Blog dataset). The Twitter dataset contains more romanization variability and misspelled words than the blog dataset. We compare across various classification models and perform extensive bench-marking using both Classical and Deep Learning Models for LID compared to existing models. We propose two architectures for language classification (Telugu and English) in CM data: (1) Word Level Classification (2) Sentence Level word-by-word Classification and compare these approaches presenting two strong baselines for LID on these datasets.
Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization
Siva Subrahamanyam Varma Kusampudi
|
Preetham Sathineni
|
Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
In a multilingual society, people communicate in more than one language, leading to Code-Mixed data. Sentimental analysis on Code-Mixed Telugu-English Text (CMTET) poses unique challenges. The unstructured nature of the Code-Mixed Data is due to the informal language, informal transliterations, and spelling errors. In this paper, we introduce an annotated dataset for Sentiment Analysis in CMTET. Also, we report an accuracy of 80.22% on this dataset using novel unsupervised data normalization with a Multilayer Perceptron (MLP) model. This proposed data normalization technique can be extended to any NLP task involving CMTET. Further, we report an increase of 2.53% accuracy due to this data normalization approach in our best model.