Sumukh S


2023

pdf bib
Event Annotation and Detection in Kannada-English Code-Mixed Social Media Data
Sumukh S | Abhinav Appidi | Manish Shrivastava
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers, Parts-of-Speech (POS) taggers and Named Entity Recognition (NER) for analysing code-mixed data have been developed. One such important tool is Event Detection, an important information retrieval task which can be used to identify critical facts occurring in the vast streams of unstructured text data available. While event detection from text is a hard problem on its own, social media data adds to it with its informal nature, and code-mixed (Kannada-English) data further complicates the problem due to its word-level mixing, lack of structure and incomplete information. In this work, we have tried to address this problem. We have proposed guidelines for the annotation of events in Kannada-English CM data and provided some baselines for the same with careful feature selection.

2022

pdf bib
“Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data
Sumukh S | Manish Shrivastava
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natural Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named entity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.

2020

pdf bib
Detection and Annotation of Events in Kannada
Suhan Prabhu | Ujwal Narayan | Alok Debnath | Sumukh S | Manish Shrivastava
Proceedings of the 16th Joint ACL-ISO Workshop on Interoperable Semantic Annotation

In this paper, we provide the basic guidelines towards the detection and linguistic analysis of events in Kannada. Kannada is a morphologically rich, resource poor Dravidian language spoken in southern India. As most information retrieval and extraction tasks are resource intensive, very little work has been done on Kannada NLP, with almost no efforts in discourse analysis and dataset creation for representing events or other semantic annotations in the text. In this paper, we linguistically analyze what constitutes an event in this language, the challenges faced with discourse level annotation and representation due to the rich derivational morphology of the language that allows free word order, numerous multi-word expressions, adverbial participle constructions and constraints on subject-verb relations. Therefore, this paper is one of the first attempts at a large scale discourse level annotation for Kannada, which can be used for semantic annotation and corpus development for other tasks in the language.