“Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data

Sumukh S, Manish Shrivastava


Abstract
Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natural Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named entity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.
Anthology ID:
2022.wnut-1.17
Volume:
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
154–161
Language:
URL:
https://aclanthology.org/2022.wnut-1.17
DOI:
Bibkey:
Cite (ACL):
Sumukh S and Manish Shrivastava. 2022. “Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), pages 154–161, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
“Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data (S & Shrivastava, WNUT 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wnut-1.17.pdf