Navya Jose
2021
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and Kannada
Bharathi Raja Chakravarthi
|
Ruba Priyadharshini
|
Navya Jose
|
Anand Kumar M
|
Thomas Mandl
|
Prasanna Kumar Kumaresan
|
Rahul Ponnusamy
|
Hariharan R L
|
John P. McCrae
|
Elizabeth Sherly
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification in under-resourced Tamil, Malayalam and Kannada languages are essential. As the user-generated content is more code-mixed and not well studied for under-resourced languages, it is imperative to create resources and conduct benchmarking studies to encourage research in under-resourced Dravidian languages. We created a shared task on offensive language detection in Dravidian languages. We summarize here the dataset for this challenge which are openly available at https://competitions.codalab.org/competitions/27654, and present an overview of the methods and the results of the competing systems.
2020
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
Bharathi Raja Chakravarthi
|
Navya Jose
|
Shardul Suryawanshi
|
Elizabeth Sherly
|
John Philip McCrae
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.