2024
pdf
bib
abs
SetFit: A Robust Approach for Offensive Content Detection in Tamil-English Code-Mixed Conversations Using Sentence Transfer Fine-tuning
Kathiravan Pannerselvam
|
Saranya Rajiakodi
|
Sajeetha Thavareesan
|
Sathiyaraj Thangasamy
|
Kishore Ponnusamy
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Code-mixed languages are increasingly prevalent on social media and online platforms, presenting significant challenges in offensive content detection for natural language processing (NLP) systems. Our study explores how effectively the Sentence Transfer Fine-tuning (Set-Fit) method, combined with logistic regression, detects offensive content in a Tamil-English code-mixed dataset. We compare our model’s performance with five other NLP models: Multilingual BERT (mBERT), LSTM, BERT, IndicBERT, and Language-agnostic BERT Sentence Embeddings (LaBSE). Our model, SetFit, outperforms these models in accuracy, achieving an impressive 89.72%, significantly higher than other models. These results suggest the sentence transformer model’s substantial potential for detecting offensive content in codemixed languages. Our study provides valuable insights into the sentence transformer model’s ability to identify various types of offensive material in Tamil-English online conversations, paving the way for more advanced NLP systems tailored to code-mixed languages.
pdf
bib
abs
Findings of the Shared Task on Hate and Offensive Language Detection in Telugu Codemixed Text (HOLD-Telugu)@DravidianLangTech 2024
Premjith B
|
Bharathi Raja Chakravarthi
|
Prasanna Kumar Kumaresan
|
Saranya Rajiakodi
|
Sai Karnati
|
Sai Mangamuru
|
Chandu Janakiram
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
This paper examines the submissions of various participating teams to the task on Hate and Offensive Language Detection in Telugu Codemixed Text (HOLD-Telugu) organized as part of DravidianLangTech 2024. In order to identify the contents containing harmful information in Telugu codemixed social media text, the shared task pushes researchers and academicians to build models. The dataset for the task was created by gathering YouTube comments and annotated manually. A total of 23 teams participated and submitted their results to the shared task. The rank list was created by assessing the submitted results using the macro F1-score.
pdf
bib
abs
Findings of the Shared Task on Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL)@DravidianLangTech 2024
Premjith B
|
Jyothish G
|
Sowmya V
|
Bharathi Raja Chakravarthi
|
K Nandhini
|
Rajeswari Natarajan
|
Abirami Murugappan
|
Bharathi B
|
Saranya Rajiakodi
|
Rahul Ponnusamy
|
Jayanth Mohan
|
Mekapati Reddy
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
This paper presents the findings of the shared task on multimodal sentiment analysis, abusive language detection and hate speech detection in Dravidian languages. Through this shared task, researchers worldwide can submit models for three crucial social media data analysis challenges in Dravidian languages: sentiment analysis, abusive language detection, and hate speech detection. The aim is to build models for deriving fine-grained sentiment analysis from multimodal data in Tamil and Malayalam, identifying abusive and hate content from multimodal data in Tamil. Three modalities make up the multimodal data: text, audio, and video. YouTube videos were gathered to create the datasets for the tasks. Thirty-nine teams took part in the competition. However, only two teams, though, turned in their findings. The macro F1-score was used to assess the submissions
pdf
bib
abs
Overview of Third Shared Task on Homophobia and Transphobia Detection in Social Media Comments
Bharathi Raja Chakravarthi
|
Prasanna Kumaresan
|
Ruba Priyadharshini
|
Paul Buitelaar
|
Asha Hegde
|
Hosahalli Shashirekha
|
Saranya Rajiakodi
|
Miguel Ángel García
|
Salud María Jiménez-Zafra
|
José García-Díaz
|
Rafael Valencia-García
|
Kishore Ponnusamy
|
Poorvi Shetty
|
Daniel García-Baena
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
This paper provides a comprehensive summary of the “Homophobia and Transphobia Detection in Social Media Comments” shared task, which was held at the LT-EDI@EACL 2024. The objective of this task was to develop systems capable of identifying instances of homophobia and transphobia within social media comments. This challenge was extended across ten languages: English, Tamil, Malayalam, Telugu, Kannada, Gujarati, Hindi, Marathi, Spanish, and Tulu. Each comment in the dataset was annotated into three categories. The shared task attracted significant interest, with over 60 teams participating through the CodaLab platform. The submission of prediction from the participants was evaluated with the macro F1 score.
pdf
bib
abs
Overview of Shared Task on Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes
Bharathi Raja Chakravarthi
|
Saranya Rajiakodi
|
Rahul Ponnusamy
|
Kathiravan Pannerselvam
|
Anand Kumar Madasamy
|
Ramachandran Rajalakshmi
|
Hariharan LekshmiAmmal
|
Anshid Kizhakkeparambil
|
Susminu S Kumar
|
Bhuvaneswari Sivagnanam
|
Charmathi Rajkumar
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
This paper offers a detailed overview of the first shared task on “Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes,” organized as part of the LT-EDI@EACL 2024 conference. The task was set to classify misogynistic content and troll memes within online platforms, focusing specifically on memes in Tamil and Malayalam languages. A total of 52 teams registered for the competition, with four submitting systems for the Tamil meme classification task and three for the Malayalam task. The outcomes of this shared task are significant, providing insights into the current state of misogynistic content in digital memes and highlighting the effectiveness of various computational approaches in identifying such detrimental content. The top-performing model got a macro F1 score of 0.73 in Tamil and 0.87 in Malayalam.
pdf
bib
abs
Overview of Shared Task on Caste and Migration Hate Speech Detection
Saranya Rajiakodi
|
Bharathi Raja Chakravarthi
|
Rahul Ponnusamy
|
Prasanna Kumaresan
|
Sathiyaraj Thangasamy
|
Bhuvaneswari Sivagnanam
|
Charmathi Rajkumar
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
We present an overview of the first shared task on “Caste and Migration Hate Speech Detection.” The shared task is organized as part of LTEDI@EACL 2024. The system must delineate between binary outcomes, ascertaining whether the text is categorized as a caste/migration hate speech or not. The dataset presented in this shared task is in Tamil, which is one of the under-resource languages. There are a total of 51 teams participated in this task. Among them, 15 teams submitted their research results for the task. To the best of our knowledge, this is the first time the shared task has been conducted on textual hate speech detection concerning caste and migration. In this study, we have conducted a systematic analysis and detailed presentation of all the contributions of the participants as well as the statistics of the dataset, which is the social media comments in Tamil language to detect hate speech. It also further goes into the details of a comprehensive analysis of the participants’ methodology and their findings.
2023
pdf
bib
abs
CSSCUTN@DravidianLangTech:Abusive comments Detection in Tamil and Telugu
Kathiravan Pannerselvam
|
Saranya Rajiakodi
|
Rahul Ponnusamy
|
Sajeetha Thavareesan
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
Code-mixing is a word or phrase-level act of interchanging two or more languages during a conversation or in written text within a sentence. This phenomenon is widespread on social media platforms, and understanding the underlying abusive comments in a code-mixed sentence is a complex challenge. We present our system in our submission for the DravidianLangTech Shared Task on Abusive Comment Detection in Tamil and Telugu. Our approach involves building a multiclass abusive detection model that recognizes 8 different labels. The provided samples are code-mixed Tamil-English text, where Tamil is represented in romanised form. We focused on the Multiclass classification subtask, and we leveraged Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Our method exhibited its effectiveness in the shared task by earning the ninth rank out of all competing systems for the classification of abusive comments in the code-mixed text. Our proposed classifier achieves an impressive accuracy of 0.99 and an F1-score of 0.99 for a balanced dataset using TF-IDF with SVM. It can be used effectively to detect abusive comments in Tamil, English code-mixed text