Part-of-Speech (POS) is one of the essential tasks for many Natural Language Processing (NLP) applications. There has been a significant amount of work done in POS tagging for resource-rich languages. POS tagging is an essential phase of text analysis in understanding the semantics and context of language. These tags are useful for higher-level tasks such as building parse trees, which can be used for Named Entity Recognition, Coreference resolution, Sentiment Analysis, and Question Answering. There has been work done on code-mixed social media corpus but not on POS tagging of Kannada-English code-mixed data. Here, we present Kannada-English code- mixed social media corpus annotated with corresponding POS tags. We also experimented with machine learning classification models CRF, Bi-LSTM, and Bi-LSTM-CRF models on our corpus.
Emotion prediction is a critical task in the field of Natural Language Processing (NLP). There has been a significant amount of work done in emotion prediction for resource-rich languages. There has been work done on code-mixed social media corpus but not on emotion prediction of Kannada-English code-mixed Twitter data. In this paper, we analyze the problem of emotion prediction on corpus obtained from code-mixed Kannada-English extracted from Twitter annotated with their respective ‘Emotion’ for each tweet. We experimented with machine learning prediction models using features like Character N-Grams, Word N-Grams, Repetitive characters, and others on SVM and LSTM on our corpus, which resulted in an accuracy of 30% and 32% respectively.
Named Entity Recognition(NER) is one of the important tasks in Natural Language Processing(NLP) and also is a subtask of Information Extraction. In this paper we present our work on NER in Telugu-English code-mixed social media data. Code-Mixing, a progeny of multilingualism is a way in which multilingual people express themselves on social media by using linguistics units from different languages within a sentence or speech context. Entity Extraction from social media data such as tweets(twitter) is in general difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. We present a Telugu-English code-mixed corpus with the corresponding named entity tags. The named entities used to tag data are Person(‘Per’), Organization(‘Org’) and Location(‘Loc’). We experimented with the machine learning models Conditional Random Fields(CRFs), Decision Trees and BiLSTMs on our corpus which resulted in a F1-score of 0.96, 0.94 and 0.95 respectively.