Kogilavani Shanmugavadivel - ACL Anthology

Kogilavani Shanmugavadivel

2025

Overview of the Shared Task on Fake News Detection in Dravidian Languages-DravidianLangTech@NAACL 2025
Malliga Subramanian | Premjith B | Kogilavani Shanmugavadivel | Santhiya Pandiyan | Balasubramanian Palani | Bharathi Raja Chakravarthi
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Detecting and mitigating fake news on social media is critical for preventing misinformation, protecting democratic processes, preventing public distress, mitigating hate speech, reducing financial fraud, maintaining information reliability, etc. This paper summarizes the findings of the shared task “Fake News Detection in Dravidian Languages—DravidianLangTech@NAACL 2025.” The goal of this task is to detect fake content in social media posts in Malayalam. It consists of two subtasks: the first focuses on binary classification (Fake or Original), while the second categorizes the fake news into five types—False, Half True, Mostly False, Partly False, and Mostly True. In Task 1, 22 teams submitted machine learning techniques like SVM, Naïve Bayes, and SGD, as well as BERT-based architectures. Among these, XLM-RoBERTa had the highest macro F1 score of 89.8%. For Task 2, 11 teams submitted models using LSTM, GRU, XLM-RoBERTa, and SVM. XLM-RoBERTa once again outperformed other models, attaining the highest macro F1 score of 68.2%.

BlueRay@DravidianLangTech-2025: Fake News Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Aiswarya M | Aruna T | Jeevaananth S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rise of fake news presents significant issues, particularly for underrepresented lan guages. This study tackles fake news identification in Dravidian languages with two subtasks: binary classification of YouTube comments and multi-class classification of Malayalam news into five groups. Text preprocessing, vectorization, and transformer-based embeddings are all part of the methodology, including baseline comparisons utilizing classic machine learning, deep learning, and transfer learning models. In Task 1, our solution placed 17th, displaying acceptable binary classification per formance. In Task 2, we finished eighth place by effectively identifying nuanced categories of Malayalam news, demonstrating the efficacy of transformer-based models.

InnovationEngineers@DravidianLangTech 2025: Enhanced CNN Models for Detecting Misogyny in Tamil Memes Using Image and Text Classification
Kogilavani Shanmugavadivel | Malliga Subramanian | Pooja Sree M | Palanimurugan V | Roshini Priya K
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rise of misogynistic memes on social media posed challenges to civil discourse. This paper aimed to detect misogyny in Dravidian language memes using a multimodal deep learning approach. We integrated Bidirectional Encoder Representations from Transformers (BERT), Long Short-Term Memory (LSTM), EfficientNet, and a Vision Language Model (VLM) to analyze textual and visual informa tion. EfficientNet extracted image features, LSTM captured sequential text patterns, and BERT learned language-specific embeddings. Among these, VLM achieved the highest accuracy of 85.0% and an F1-score of 70.8, effectively capturing visual-textual relationships. Validated on a curated dataset, our method outperformed baselines in precision, recall, and F1-score. Our approach ranked 12th out of 118 participants for the Tamil language, highlighting its competitive performance. This research emphasizes the importance of multimodal models in detecting harmful content. Future work can explore improved feature fusion techniques to enhance classification accuracy.

KEC_AI_ZEROWATTS@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Naveenram C E | Vishal Rs | Srinesh S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Hate speech detection in code-mixed Dravidian languages presents significant challenges due to the multilingual and unstructured nature of the data. In this work, we participated in the shared task to detect hate speech in Tamil, Malayalam, and Telugu using both text and audio data. We explored various machine learning models, including Logistic Regression, Ridge Classifier, Random Forest, and Convolutional Neural Networks (CNN). For Tamil text data, Logistic Regression achieved the highest macro-F1 score of 0.97, while Ridge Classifier performed best for audio with 0.75. In Malayalam, Random Forest excelled for text with 0.97, and CNN for audio with 0.69. For Telugu, Ridge Classifier achieved 0.89 for text and CNN 0.87 for audio.These results demonstrate the efficacy of our multimodal approach in addressing the complexity of hate speech detection across the Dravidian languages.Tamil:11th rank, Malayalam :6th rank,Telugu:8th rank among 145 teams

Beyond_Tech@DravidianLangTech 2025: Political Multiclass Sentiment Analysis using Machine Learning and Neural Network
Kogilavani Shanmugavadivel | Malliga Subramanian | Sanjai R | Mohammed Sameer | Motheeswaran K
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Research on political feeling is essential for comprehending public opinion in the digital age, as social media and news platforms are often the sites of discussions. To categorize political remarks into sentiments like positive, negative, neutral, opinionated, substantiated, and sarcastic, this study offers a multiclass sentiment analysis approach. We trained models, such as Random Forest and a Feedforward Neural Network, after preprocessing and feature extraction from a large dataset of political texts using Natural Language Processing approaches. The Random Forest model, which was great at identifying more complex attitudes like sar casm and opinionated utterances, had the great est accuracy of 84%, followed closely by the Feedforward Neural Network model, which had 83%. These results highlight how well political discourse can be analyzed by combining deep learning and traditional machine learning techniques. There is also room for improvement by adding external metadata and using sophisticated models like BERT for better sentiment classification.

The_Deathly_Hallows@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Vasantharan K | Prethish G A | Santhosh S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The DravidianLangTech@NAACL 2025 shared task focused on multimodal hate speech detection in Tamil, Telugu, and Malayalam using social media text and audio. Our approach integrated advanced preprocessing, feature extraction, and deep learning models. For text, preprocessing steps included normalization, tokenization, stopword removal, and data augmentation. Feature extraction was performed using TF-IDF, Count Vectorizer, BERT-base-multilingual-cased, XLM-Roberta-Base, and XLM-Roberta-Large, with the latter achieving the best performance. The models attained training accuracies of 83% (Tamil), 88% (Telugu), and 85% (Malayalam). For audio, Mel Frequency Cepstral Coefficients (MFCCs) were extracted and enhanced with augmentation techniques such as noise addition, time-stretching, and pitch-shifting. A CNN-based model achieved training accuracies of 88% (Tamil), 88% (Telugu), and 93% (Malayalam). Macro F1 scores ranked Tamil 3rd (0.6438), Telugu 15th (0.1559), and Malayalam 12th (0.3016). Our study highlights the effectiveness of text-audio fusion in hate speech detection and underscores the importance of preprocessing, multimodal techniques, and feature augmentation in addressing hate speech on social media.

KEC_AI_BRIGHTRED@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Nishdharani P | Santhiya E | Yaswanth Raj E
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Hate speech detection in multilingual settings presents significant challenges due to linguistic variations and speech patterns across different languages. This study proposes a fusion-based approach that integrates audio and text features to enhance classification accuracy in Tamil, Telugu, and Malayalam. We extract Mel- Frequency Cepstral Coefficients and their delta variations for speech representation, while textbased features contribute additional linguistic insights. Several models were evaluated, including BiLSTM, Capsule Networks with Attention, Capsule-GRU, ConvLSTM-BiLSTM, and Multinomial Naïve Bayes, to determine the most effective architecture. Experimental results demonstrate that Random Forest performs best for text classification, while CNN achieves the highest accuracy for audio classification. The model was evaluated using the Macro F1 score and ranked ninth in Tamil with a score of 0.3018, ninth in Telugu with a score of 0.251, and thirteenth in Malayalam with a score of 0.2782 in the Multimodal Social Media Data Analysis in Dravidian Languages shared task at DravidianLangTech@NAACL 2025. By leveraging feature fusion and optimized model selection, this approach provides a scalable and effective framework for multilingual hate speech detection, contributing to improved content moderation on social media platforms.

KEC-Elite-Analysts@LT-EDI 2025: Leveraging Deep Learning for Racial Hoax Detection in Code-Mixed Hindi-English Tweets
Malliga Subramanian | Aruna A | Amudhavan M | Jahaganapathi S | Kogilavani Shanmugavadivel
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

Detecting misinformation in code-mixed languages, particularly Hindi-English, poses significant challenges in natural language processing due to the linguistic diversity found on social media. This paper focuses on racial hoax detection—false narratives that target specific communities—within Hindi-English YouTube comments. We evaluate the effectiveness of several machine learning models, including Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Multi-Layer Perceptron, using a dataset of 5,105 annotated comments. Model performance is assessed using accuracy, precision, recall, and F1-score. Experimental results indicate that neural and ensemble models consistently outperform traditional classifiers. Future work will explore the use of transformer-based architectures and data augmentation techniques to enhance detection in low-resource, code-mixed scenarios.

Findings of the Shared Task Caste and Migration Hate Speech Detection
Saranya Rajiakodi | Bharathi Raja Chakravarthi | Rahul Ponnusamy | Shunmuga Priya Muthusamy Chinnan | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy | Bhuvaneswari Sivagnanam | Balasubramanian Palani | Kogilavani Shanmugavadivel | Abirami Murugappan | Charmathi Rajkumar
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

Hate speech targeting caste and migration communities is a growing concern in online platforms, particularly in linguistically diverse regions. By focusing on Tamil language text content, this task provides a unique opportunity to tackle caste or migration related hate speech detection in a low resource language Tamil, contributing to a safer digital space. We present the results and main findings of the shared task caste and migration hate speech detection. The task is a binary classification determining whether a text is caste/migration related hate speech or not. The task attracted 17 participating teams, experimenting with a wide range of methodologies from traditional machine learning to advanced multilingual transformers. The top performing system achieved a macro F1-score of 0.88105, enhancing an ensemble of fine-tuned transformer models including XLM-R and MuRIL. Our analysis highlights the effectiveness of multilingual transformers in low resource, ensemble learning, and culturally informed socio political context based techniques.

KEC_TECH_TITANS@DravidianLangTech 2025: Abusive Text Detection in Tamil and Malayalam Social Media Comments Using Machine Learning
Malliga Subramanian | Kogilavani Shanmugavadivel | Deepiga P | Dharshini S | Ananthakumar S | Praveenkumar C
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Social media platforms have become a breeding ground for hostility and toxicity, with abusive language targeting women becoming a pervasive issue. This paper addresses the detection of abusive content in Tamil and Malayalam social media comments using machine learning models. We experimented with GRU, LSTM, Bidirectional LSTM, CNN, FastText, and XGBoost models, evaluating their performance on a code-mixed dataset of Tamil and Malayalam comments collected from YouTube. Our findings demonstrate that FastText and CNN models yielded the best performance among the evaluated classifiers, achieving F1-scores of 0.73 each. This study contributes to the ongoing research on abusive text detection for under-resourced languages and highlights the need for robust, scalable solutions to combat online toxicity.

TEAM_STRIKERS@DravidianLangTech2025: Misogyny Meme Detection in Tamil Using Multimodal Deep Learning
Kogilavani Shanmugavadivel | Malliga Subramanian | Mohamed Arsath H | Ramya K | Ragav R
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This study focuses on detecting misogynistic content in memes under the title Misogynistic. Meme Detection Using Multimodal Deep Learning. Through an analysis of both textual and visual components of memes, specifically in Tamil, the study seeks to detect misogynistic rhetoric directed towards women. Preprocessing and vectorizing text data using methods like TF-IDF, GloVe, Word2Vec, and transformer-based embeddings like BERT are all part of the textual analysis process. Deep learning models like ResNet and EfficientNet are used to extract significant image attributes for the visual component. To improve classification performance, these characteristics are then combined in a multimodal framework employing hybrid architectures such as CNN-LSTM, GRU-EfficientNet, and ResNet-BERT. The classification of memes as misogynistic or non-misogynistic is done using sophisticated machine learning and deep learning ap proaches. Model performance is evaluated using metrics like Accuracy, Precision, Recall, F1-Score, and Macro Average F1-Score. This study shows how multimodal deep learning can effectively detect and counteract negative narratives about women in digital media by combining natural language processing with image classification.

KEC-Elite-Analysts@DravidianLangTech 2025: Deciphering Emotions in Tamil-English and Code-Mixed Social Media Tweets
Malliga Subramanian | Aruna A | Anbarasan T | Amudhavan M | Jahaganapathi S | Kogilavani Shanmugavadivel
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Sentiment analysis in code-mixed languages, particularly Tamil-English, is a growing challenge in natural language processing (NLP) due to the prevalence of multilingual communities on social media. This paper explores various machine learning and transformer-based models, including Logistic Regression, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), BERT, and mBERT, for sentiment classification of Tamil-English code-mixed text. The models are evaluated on a shared task dataset provided by DravidianLangTech@NAACL 2025, with performance measured through accuracy, precision, recall, and F1-score. Our results demonstrate that transformer-based models, particularly mBERT, outperform traditional classifiers in identifying sentiment polarity. Future work aims to address the challenges posed by code-switching and class imbalance through advanced model architectures and data augmentation techniques.

KEC_AI_DATA_DRIFTERS@DravidianLangTech 2025: Fake News Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Vishali K S | Priyanka B | Naveen Kumar K
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Detecting fake news in Malayalam possess significant challenges due to linguistic diversity, code-mixing, and the limited availability of structured datasets. We participated in the Fake News Detection in Dravidian Languages shared task, classifying news and social media posts into binary and multi-class categories. Our experiments used traditional ML models: Support Vector Machine (SVM), Random Forest, Logistic Regression, Naive Bayes and transfer learning models: Multilingual Bert (mBERT) and XLNet. In binary classification, SVM achieved the highest macro-F1 score of 0.97, while in multi-class classification, it also outperformed other models with a macro-F1 score of 0.98. Random Forest ranked second in both tasks. Despite their advanced capabilities, mBERT and XLNet exhibited lower precision due to data limitations. Our approach enhances fake news detection and NLP solutions for low-resource languages.

The_Deathly_Hallows@DravidianLangTech 2025: AI Content Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Vasantharan K | Prethish G A | Vijayakumaran S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The DravidianLangTech@NAACL 2025 shared task focused on Detecting AI-generated Product Reviews in Dravidian Languages, aiming to address the challenge of distinguishing AI-generated content from human-written reviews in Tamil and Malayalam. As AI generated text becomes more prevalent, ensuring the authenticity of online product reviews is crucial for maintaining consumer trust and preventing misinformation. In this study, we explore various feature extraction techniques, including TF-IDF, Count Vectorizer, and transformer-based embeddings such as BERT-Base-Multilingual-Cased and XLM-RoBERTa-Large, to build a robust classification model. Our approach achieved F1-scores of 0.9298 for Tamil and 0.8797 for Malayalam, ranking 8th in Tamil and 11th in Malayalam among all participants. The results highlight the effectiveness of transformer-based embeddings in differentiating AI-generated and human-written content. This research contributes to the growing body of work on AI-generated content detection, particularly in underrepresented Dravidian languages, and provides insights into the challenges unique to these languages.

KEC_AI_GRYFFINDOR@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel | Malliga Subramanian | ShahidKhan S | Shri Sashmitha.s | Yashica S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

It is difficult to detect hate speech in codemixed Dravidian languages because the data is multilingual and unstructured. We took part in the shared task to detect hate speech in text and audio data for Tamil, Malayalam, and Telugu in this research. We tested different machine learning and deep learning models such as Logistic Regression, Ridge Classifier, Random Forest, and CNN. For Tamil, Logistic Regression gave the best macro-F1 score of 0.97 for text, whereas Ridge Classifier was the best for audio with a score of 0.75. For Malayalam, Random Forest gave the best F1-score of 0.97 for text, and CNN was the best for audio (F1 score: 0.69). For Telugu, Ridge Classifier gave the best F1-score of 0.89 for text, whereas CNN was the best for audio (F1-score: 0.87).Our findings prove that a multimodal solution effi ciently tackles the intricacy of hate speech detection in Dravidian languages. In this shared task,out of 145 teams we attained the 12th rank for Tamil and 7th rank for Malayalam and Telugu.

Team_Catalysts@DravidianLangTech 2025: Leveraging Political Sentiment Analysis using Machine Learning Techniques for Classifying Tamil Tweets
Kogilavani Shanmugavadivel | Malliga Subramanian | Subhadevi K | Sowbharanika Janani Sivakumar | Rahul K
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This work proposed a methodology for assessing political sentiments in Tamil tweets using machine learning models. The approach addressed linguistic challenges in Tamil text, including cleaning, normalization, tokenization, and class imbalance, through a robust preprocessing pipeline. Various models, including Random Forest, Logistic Regression, and CatBoost, were applied, with Random Forest achieving a macro F1-score of 0.2933 and securing 8th rank among 153 participants in the Codalab competition. This accomplishment highlights the effectiveness of machine learning models in handling the complexities of multilingual, code-mixed, and unstructured data in Tamil political discourse. The study also emphasized the importance of tailored preprocessing techniques to improve model accuracy and performance. It demonstrated the potential of computational linguistics and machine learning in understanding political discourse in low-resource languages like Tamil, contributing to advancements in regional sentiment analysis.

KECEmpower@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Malliga Subramanian | Kogilavani Shanmugavadivel | Indhuja V S | Kowshik P | Jayasurya S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The detection of abusive text targeting women, especially in Dravidian languages like Tamil and Malayalam, presents a unique challenge due to linguistic complexities and code-mixing on social media. This paper evaluates machine learning models such as Support Vector Machines (SVM), Logistic Regression (LR), and Random Forest Classifiers (RFC) for identifying abusive content. Code-mixed datasets sourced from platforms like YouTube are used to train and test the models. Performance is evaluated using accuracy, precision, recall, and F1-score metrics. Our findings show that SVM outperforms the other classifiers in accuracy and recall. However, challenges persist in detecting implicit abuse and addressing informal, culturally nuanced language. Future work will explore transformer-based models like BERT for better context understanding, along with data augmentation techniques to enhance model performance. Additionally, efforts will focus on expanding labeled datasets to improve abuse detection in these low-resource languages.

KECLinguAIsts@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
Malliga Subramanian | Rojitha R | Mithun Chakravarthy Y | Renusri R V | Kogilavani Shanmugavadivel
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

With the surge of AI-generated content in online spaces, ensuring the authenticity of product reviews has become a critical challenge. This paper addresses the task of detecting AI-generated product reviews in Dravidian languages, specifically Tamil and Malayalam, which present unique hurdles due to their complex morphology, rich syntactic structures, and code-mixed nature. We introduce a novel methodology combining machine learning classifiers with advanced multilingual transformer models to identify AI-generated reviews. Our approach not only accounts for the linguistic intricacies of these languages but also leverages domain specific datasets to improve detection accuracy. For Tamil, we evaluate Logistic Regression, Random Forest, and XGBoost, while for Malayalam, we explore Logistic Regression, Multinomial Naive Bayes (MNB), and Support Vector Machines (SVM). Transformer based models significantly outperform these traditional classifiers, demonstrating superior performance across multiple metrics.

KEC_TECH_TITANS@DravidianLangTech 2025:Sentiment Analysis for Low-Resource Languages: Insights from Tamil and Tulu using Deep Learning and Machine Learning Models
Malliga Subramanian | Kogilavani Shanmugavadivel | Dharshini S | Deepiga P | Praveenkumar C | Ananthakumar S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Sentiment analysis in Dravidian languages like Tamil and Tulu presents significant challenges due to their linguistic diversity and limited resources for natural language processing (NLP). This study explores sentiment classification for Tamil and Tulu, focusing on the complexities of handling both languages, which differ in script, grammar, and vocabulary. We employ a variety of machine learning and deep learning techniques, including traditional models like Support Vector Machines (SVM), and K-Nearest Neighbors (KNN), as well as advanced transformer-based models like BERT and multilingual BERT (mBERT). A key focus of this research is to evaluate the performance of these models on sentiment analysis tasks, considering metrics such as accuracy, precision, recall, and F1-score. The results show that transformer-based models, particularly mBERT, significantly outperform traditional machine learning models in both Tamil and Tulu sentiment classification. This study also highlights the need for further research into addressing challenges like language-specific nuances, dataset imbalance, and data augmentation techniques for improved sentiment analysis in under-resourced languages like Tamil and Tulu.

KEC_AI_VSS_run2@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Kogilavani Shanmugavadivel | Malliga Subramanian | Sathiyaseelan S | Suresh Babu K | Vasikaran S
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The increasing instances of abusive language against women on social media platforms have brought to the fore the need for effective content moderation systems, especially in low-resource languages like Tamil and Malayalam. This paper addresses the challenge of detecting gender-based abuse in YouTube comments using annotated datasets in these languages. Comments are classified into abusive and non-abusive categories. We applied the following machine learning algorithms, namely Random Forest, Support Vector Machine, K-Nearest Neighbor, Gradient Boosting and AdaBoost for classification. Micro F1 score of 0.95 was achieved by SVM for Tamil and 0.72 by Random Forest for Malayalam. Our system participated in the shared task on abusive comment detection, out of 160 teams achieving the rank of 13th for Malayalam and rank 34 for Tamil, and both indicate both the challenges and potential of our approach in low-resource language processing. Our findings have highlighted the significance of tailored approaches to language-specific abuse detection.

2024

Code_Makers@DravidianLangTech-EACL 2024 : Sentiment Analysis in Code-Mixed Tamil using Machine Learning Techniques
Kogilavani Shanmugavadivel | Sowbharanika J S | Navbila K | Malliga Subramanian
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rising importance of sentiment analysis online community research is addressed in our project, which focuses on the surge of code-mixed writing in multilingual social media. Targeting sentiments in texts combining Tamil and English, our supervised learning approach, particularly the Decision Tree algorithm, proves essential for effective sentiment classification. Notably, Decision Tree(accuracy: 0.99, average F1 score: 0.39), Random Forest exhibit high accuracy (accuracy: 0.99, macro average F1 score : 0.35), SVM (accuracy: 0.78, macro average F1 score : 0.68), Logistic Regression (accuracy: 0.75, macro average F1 score: 0.62), KNN (accuracy: 0.73, macro average F1 score : 0.26) also demonstrate commendable results. These findings showcase the project’s efficacy, offering promise for linguistic research and technological advancements. Securing the 8th rank emphasizes its recognition in the field.

Overview of the Second Shared Task on Fake News Detection in Dravidian Languages: DravidianLangTech@EACL 2024
Malliga Subramanian | Bharathi Raja Chakravarthi | Kogilavani Shanmugavadivel | Santhiya Pandiyan | Prasanna Kumar Kumaresan | Balasubramanian Palani | Premjith B | Vanaja K | Mithunja S | Devika K | Hariprasath S.b | Haripriya B | Vigneshwar E
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rise of online social media has revolutionized communication, offering users a convenient way to share information and stay updated on current events. However, this surge in connectivity has also led to the proliferation of misinformation, commonly known as fake news. This misleading content, often disguised as legitimate news, poses a significant challenge as it can distort public perception and erode trust in reliable sources. This shared task consists of two subtasks such as task 1 and task 2. Task 1 aims to classify a given social media text into original or fake. The goal of the FakeDetect-Malayalam task2 is to encourage participants to develop effective models capable of accurately detecting and classifying fake news articles in the Malayalam language into different categories like False, Half True, Mostly False, Partly False, and Mostly True. For this shared task, 33 participants submitted their results.

MIT-KEC-NLP@DravidianLangTech-EACL 2024: Offensive Content Detection in Kannada and Kannada-English Mixed Text Using Deep Learning Techniques
Kogilavani Shanmugavadivel | Sowbarnigaa K S | Mehal Sakthi M S | Subhadevi K | Malliga Subramanian
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This study presents a strong methodology for detecting offensive content in multilingual text, with a focus on Kannada and Kannada-English mixed comments. The first step in data preprocessing is to work with a dataset containing Kannada comments, which is backed by Google Translate for Kannada-English translation. Following tokenization and sequence labeling, BIO tags are assigned to indicate the existence and bounds of objectionable spans within the text. On annotated data, a Bidirectional LSTM neural network model is trained and BiLSTM model’s macro F1 score is 61.0 in recognizing objectionable content. Data preparation, model architecture definition, and iterative training with Kannada and Kannada- English text are all part of the training process. In a fresh dataset, the trained model accurately predicts offensive spans, emphasizing comments in the aforementioned languages. Predictions that have been recorded and include offensive span indices are organized into a database.

InnovationEngineers@DravidianLangTech-EACL 2024: Sentimental Analysis of YouTube Comments in Tamil by using Machine Learning
Kogilavani Shanmugavadivel | Malliga Subramanian | Palanimurugan V | Pavul chinnappan D
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

There is opportunity for machine learning and natural language processing research because of the growing volume of textual data. Although there has been little research done on trend extraction from YouTube comments, sentiment analysis is an intriguing issue because of the poor consistency and quality of the material found there. The purpose of this work is to use machine learning techniques and algorithms to do sentiment analysis on YouTube comments pertaining to popular themes. The findings demonstrate that sentiment analysis is capable of giving a clear picture of how actual events affect public opinion. This study aims to make it easier for academics to find high-quality sentiment analysis research publications. Data normalisation methods are used to clean an annotated corpus of 1500 citation sentences for the study. .For classification, a system utilising one machine learning algorithm—K-Nearest Neighbour (KNN), Na ̈ıve Bayes, SVC (Support Vector Machine), and RandomForest—is built. Metrics like the f1-score and correctness score are used to assess the correctness of the system.

Beyond Tech@DravidianLangTech2024 : Fake News Detection in Dravidian Languages Using Machine Learning
Kogilavani Shanmugavadivel | Malliga Subramanian | Sanjai R | Mohammed Sameer B | Motheeswaran K
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

In the digital age, identifying fake news is essential when fake information travels quickly via social media platforms. This project employs machine learning techniques, including Random Forest, Logistic Regression, and Decision Tree, to distinguish between real and fake news. With the rise of news consumption on social media, it becomes essential to authenticate information shared on platforms like YouTube comments. The research emphasizes the need to stop spreading harmful rumors and focuses on authenticating news articles. The proposed model utilizes machine learning and natural language processing, specifically Support Vector Machines, to aggregate and determine the authenticity of news. To address the challenges of detecting fake news in this paper, describe the Machine Learning (ML) models submitted to ‘Fake News Detection in Dravidian Languages” at DravidianLangTech@EACL 2024 shared task. Four different models, namely: Naive Bayes, Support Vector Machine (SVM), Random forest, and Decision tree.

KEC_HAWKS@DravidianLangTech 2024 : Detecting Malayalam Fake News using Machine Learning Models
Malliga Subramanian | Jayanthjr J R | Muthu Karuppan P | Keerthibala T | Kogilavani Shanmugavadivel
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The proliferation of fake news in the Malayalam language across digital platforms has emerged as a pressing issue. By employing Recurrent Neural Networks (RNNs), a type of machine learning model, we aim to distinguish between Original and Fake News in Malayalam and achieved 9th rank in Task 1.RNNs are chosen for their ability to understand the sequence of words in a sentence, which is important in languages like Malayalam. Our main goal is to develop better models that can spot fake news effectively. We analyze various features to understand what contributes most to this accuracy. By doing so, we hope to provide a reliable method for identifying and combating fake news in the Malayalam language.

KEC-AI-NLP@LT-EDI-2024:Homophobia and Transphobia Detection in Social Media Comments using Machine Learning
Kogilavani Shanmugavadivel | Malliga Subramanian | Shri R | Srigha S | Samyuktha K | Nithika K
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Our work addresses the growing concern of abusive comments in online platforms, particularly focusing on the identification of Homophobia and Transphobia in social media comments. The goal is to categorize comments into three classes: Homophobia, Transphobia, and non-anti LGBT+ comments. Utilizing machine learning techniques and a deep learning model, our work involves training on a English dataset with a designated training set and testing on a validation set. This approach aims to contribute to the understanding and detection of Homophobia and Transphobia within the realm of social media interactions. Our team participated in the shared task organized by LTEDI@EACL 2024 and secured seventh rank in the task of Homophobia/Transphobia Detection in social media comments in Tamil with a macro- f1 score of 0.315. Also, our run was submitted for the English language and secured eighth rank with a macro-F1 score of 0.369. The run submitted for Malayalam language securing fourth rank with a macro- F1 score of 0.883 using the Random Forest model.

KEC AI DSNLP@LT-EDI-2024:Caste and Migration Hate Speech Detection using Machine Learning Techniques
Kogilavani Shanmugavadivel | Malliga Subramanian | Aiswarya M | Aruna T | Jeevaananth S
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Commonly used language defines “hate speech” as objectionable statements that may jeopardize societal harmony by singling out a group or a person based on fundamental traits (including gender, caste, or religion). Using machine learning techniques, our research focuses on identifying hate speech in social media comments. Using a variety of machine learning methods, we created machine learning models to detect hate speech. An approximate Macro F1 of 0.60 was attained by the created models.

KEC_AI_MIRACLE_MAKERS@LT-EDI-2024: Stress Identification in Dravidian Languages using Machine Learning Techniques
Kogilavani Shanmugavadivel | Malliga Subramanian | Monika J | Monishaa S | Rishibalan B
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Identifying an individual where he/she is stressed or not stressed is our shared task topic. we have used several machine learning models for identifying the stress. This paper presents our system submission for the task 1 and 2 for both Tamil and Telugu dataset, focusing on us- ing supervised approaches. For Tamil dataset, we got highest accuracy for the Support Vector Machine model with f1-score of 0.98 and for Telugu dataset, we got highest accuracy for Random Forest algorithm with f1-score of 0.99. By using this model, Stress Identification System will be helpful for an individual to improve their mental health in optimistic manner.

2023

Team-KEC@LT-EDI: Detecting Signs of Depression from Social Media Text
Malliga S | Kogilavani Shanmugavadivel | Arunaa S | Gokulkrishna R | Chandramukhii A
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

The rise of social media has led to a drastic surge in the dissemination of hostile and toxic content, fostering an alarming proliferation of hate speech, inflammatory remarks, and abusive language. The exponential growth of social media has facilitated the widespread circulation of hostile and toxic content, giving rise to an unprecedented influx of hate speech, incendiary language, and abusive rhetoric. The study utilized different techniques to represent the text data in a numerical format. Word embedding techniques aim to capture the semantic and syntactic information of the text data, which is essential in text classification tasks. The study utilized various techniques such as CNN, BERT, and N-gram to classify social media posts into depression and non-depression categories. Text classification tasks often rely on deep learning techniques such as Convolutional Neural Networks (CNN), while the BERT model, which is pre-trained, has shown exceptional performance in a range of natural language processing tasks. To assess the effectiveness of the suggested approaches, the research employed multiple metrics, including accuracy, precision, recall, and F1-score. The outcomes of the investigation indicate that the suggested techniques can identify symptoms of depression with an average accuracy rate of 56%.

Overview of the shared task on Detecting Signs of Depression from Social Media Text
Kayalvizhi Sampath | Durairaj Thenmozhi | Bharathi Raja Chakravarthi | Jerin Mahibha C | Kogilavani Shanmugavadivel | Pratik Anil Rahood
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Social media has become a vital platform for personal communication. Its widespread use as a primary means of public communication offers an exciting opportunity for early detection and management of mental health issues. People often share their emotions on social media, but understanding the true depth of their feelings can be challenging. Depression, a prevalent problem among young people, is of particular concern due to its link with rising suicide rates. Identifying depression levels in social media texts is crucial for timely support and prevention of negative outcomes. However, it’s a complex task because human emotions are dynamic and can change significantly over time. The DepSign-LT-EDI@RANLP 2023 shared task aims to classify social media text into three depression levels: “Not Depressed,” “Moderately Depressed,” and “Severely Depressed.” This overview covers task details, dataset, methodologies used, and results analysis. Roberta-based models emerged as top performers, with the best result achieving an impressive macro F1-score of 0.584 among 31 participating teams.

KEC_AI_NLP_DEP @ LT-EDI : Detecting Signs of Depression From Social Media Texts
Kogilavani Shanmugavadivel | Malliga Subramanian | Vasantharan K | Prethish Ga | Sankar S | Sabari S
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

The goal of this study is to use machine learning approaches to detect depression indications in social media articles. Data gathering, pre-processing, feature extraction, model training, and performance evaluation are all aspects of the research. The collection consists of social media messages classified into three categories: not depressed, somewhat depressed, and severely depressed. The study contributes to the growing field of social media data-driven mental health analysis by stressing the use of feature extraction algorithms for obtaining relevant information from text data. The use of social media communications to detect depression has the potential to increase early intervention and help for people at risk. Several feature extraction approaches, such as TF-IDF, Count Vectorizer, and Hashing Vectorizer, are used to quantitatively represent textual data. These features are used to train and evaluate a wide range of machine learning models, including Logistic Regression, Random Forest, Decision Tree, Gaussian Naive Bayes, and Multinomial Naive Bayes. To assess the performance of the models, metrics such as accuracy, precision, recall, F1 score, and the confusion matrix are utilized. The Random Forest model with Count Vectorizer had the greatest accuracy on the development dataset, coming in at 92.99 percent. And with a macro F1-score of 0.362, we came in 19th position in the shared task. The findings show that machine learning is effective in detecting depression markers in social media articles.

Overview of Shared-task on Abusive Comment Detection in Tamil and Telugu
Ruba Priyadharshini | Bharathi Raja Chakravarthi | Malliga Subramanian | Subalalitha Chinnaudayar Navaneethakrishnan | Kogilavani Shanmugavadivel | Premjith B | Abirami Murugappan | Prasanna Kumar Kumaresan | Karnati Sai Prashanth | Mangamuru Sai Rishith Reddy | Janakiram Chandu
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This paper discusses the submissions to the shared task on abusive comment detection in Tamil and Telugu codemixed social media text conducted as part of the third Workshop on Speech and Language Technologies for Dravidian Languages at RANLP 20239. The task encourages researchers to develop models to detect the contents containing abusive information in Tamil and Telugu codemixed social media text. The task has three subtasks - abusive comment detection in Tamil, Tamil-English and Telugu-English. The dataset for all the tasks was developed by collecting comments from YouTube. The submitted models were evaluated using macro F1-score, and prepared the rank list accordingly.

KEC_AI_NLP@DravidianLangTech: Sentiment Analysis in Code Mixture Language
Kogilavani Shanmugavadivel | Malliga Subramanian | VetriVendhan S | Pramoth Kumar M | Karthickeyan S | Kavin Vishnu N
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment Analysis is a process that involves analyzing digital text to determine the emo- tional tone, such as positive, negative, neu- tral, or unknown. Sentiment Analysis of code- mixed languages presents challenges in natural language processing due to the complexity of code-mixed data, which combines vocabulary and grammar from multiple languages and cre- ates unique structures. The scarcity of anno- tated data and the unstructured nature of code- mixed data are major challenges. To address these challenges, we explored various tech- niques, including Machine Learning models such as Decision Trees, Random Forests, Lo- gistic Regression, and Gaussian Na ̈ıve Bayes, Deep Learning model, such as Long Short- Term Memory (LSTM), and Transfer Learning model like BERT, were also utilized. In this work, we obtained the dataset from the Dravid- ianLangTech shared task by participating in a competition and accessing train, development and test data for Tamil Language. The results demonstrated promising performance in senti- ment analysis of code-mixed text. Among all the models, deep learning model LSTM pro- vides best accuracy of 0.61 for Tamil language.

Overview of the shared task on Fake News Detection from Social Media Text
Malliga Subramanian | Bharathi Raja Chakravarthi | Kogilavani Shanmugavadivel | Santhiya Pandiyan | Prasanna Kumar Kumaresan | Balasubramanian Palani | Muskaan Singh | Sandhiya Raja | Vanaja | Mithunajha S
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

VEL@LT-EDI: Detecting Homophobia and Transphobia in Code-Mixed Spanish Social Media Comments
Prasanna Kumar Kumaresan | Kishore Kumar Ponnusamy | Kogilavani Shanmugavadivel | Subalalitha Chinnaudayar Navaneethakrishnan | Ruba Priyadharshini | Bharathi Raja Chakravarthi
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Our research aims to address the task of detecting homophobia and transphobia in social media code-mixed comments written in Spanish. Code-mixed text in social media often violates strict grammar rules and incorporates non-native scripts, posing challenges for identification. To tackle this problem, we perform pre-processing by removing unnecessary content and establishing a baseline for detecting homophobia and transphobia. Furthermore, we explore the effectiveness of various traditional machine-learning models with feature extraction and pre-trained transformer model techniques. Our best configurations achieve macro F1 scores of 0.84 on the test set and 0.82 on the development set for Spanish, demonstrating promising results in detecting instances of homophobia and transphobia in code-mixed comments.

KEC_AI_NLP@DravidianLangTech: Abusive Comment Detection in Tamil Language
Kogilavani Shanmugavadivel | Malliga Subramanian | Shri Durga R | Srigha S | Sree Harene J S | Yasvanth Bala P
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Our work aims to identify the negative comments that is associated with Counter-speech,Xenophobia, Homophobia,Transphobia, Misandry, Misogyny, None-of-the-above categories, In order to identify these categories from the given dataset, we propose three different models such as traditional machine learning techniques, deep learning model and transfer Learning model called BERT is also used to analyze the texts. In the Tamil dataset, we are training the models with Train dataset and test the models with Validation data. Our Team Participated in the shared task organised by DravidianLangTech and secured 4th rank in the task of abusive comment detection in Tamil with a macro- f1 score of 0.35. Also, our run was submitted for abusive comment detection in code-mixed languages (Tamil-English) and secured 6th rank with a macro-f1 score of 0.42.

2022

Overview of Abusive Comment Detection in Tamil-ACL 2022
Ruba Priyadharshini | Bharathi Raja Chakravarthi | Subalalitha Chinnaudayar Navaneethakrishnan | Thenmozhi Durairaj | Malliga Subramanian | Kogilavani Shanmugavadivel | Siddhanth U Hegde | Prasanna Kumar Kumaresan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

The social media is one of the significantdigital platforms that create a huge im-pact in peoples of all levels. The commentsposted on social media is powerful enoughto even change the political and businessscenarios in very few hours. They alsotend to attack a particular individual ora group of individuals. This shared taskaims at detecting the abusive comments in-volving, Homophobia, Misandry, Counter-speech, Misogyny, Xenophobia, Transpho-bic. The hope speech is also identified. Adataset collected from social media taggedwith the above said categories in Tamiland Tamil-English code-mixed languagesare given to the participants. The par-ticipants used different machine learningand deep learning algorithms. This paperpresents the overview of this task compris-ing the dataset details and results of theparticipants.

Findings of the Shared Task on Multi-task Learning in Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | CN Subalalitha | Sangeetha Sivanesan | Malliga Subramanian | Kogilavani Shanmugavadivel | Parameswari Krishnamurthy | Adeep Hande | Siddhanth U Hegde | Roshan Nayak | Swetha Valli
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

We present our findings from the first shared task on Multi-task Learning in Dravidian Languages at the second Workshop on Speech and Language Technologies for Dravidian Languages. In this task, a sentence in any of three Dravidian Languages is required to be classified into two closely related tasks namely Sentiment Analyis (SA) and Offensive Language Identification (OLI). The task spans over three Dravidian Languages, namely, Kannada, Malayalam, and Tamil. It is one of the first shared tasks that focuses on Multi-task Learning for closely related tasks, especially for a very low-resourced language family such as the Dravidian language family. In total, 55 people signed up to participate in the task, and due to the intricate nature of the task, especially in its first iteration, 3 submissions have been received.

Transformers at SemEval-2022 Task 5: A Feature Extraction based Approach for Misogynous Meme Detection
Shankar Mahadevan | Sean Benhur | Roshan Nayak | Malliga Subramanian | Kogilavani Shanmugavadivel | Kanchana Sivanraju | Bharathi Raja Chakravarthi
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Social media is an idea created to make theworld smaller and more connected. Recently,it has become a hub of fake news and sexistmemes that target women. Social Media shouldensure proper women’s safety and equality. Filteringsuch information from social media is ofparamount importance to achieving this goal. In this paper, we describe the system developedby our team for SemEval-2022 Task 5: MultimediaAutomatic Misogyny Identification. Wepropose a multimodal training methodologythat achieves good performance on both thesubtasks, ranking 4th for Subtask A (0.718macro F1-score) and 9th for Subtask B (0.695macro F1-score) while exceeding the baselineresults by good margins.

Findings of the Shared Task on Emotion Analysis in Tamil
Anbukkarasi Sampath | Thenmozhi Durairaj | Bharathi Raja Chakravarthi | Ruba Priyadharshini | Subalalitha Cn | Kogilavani Shanmugavadivel | Sajeetha Thavareesan | Sathiyaraj Thangasamy | Parameswari Krishnamurthy | Adeep Hande | Sean Benhur | Kishore Ponnusamy | Santhiya Pandiyan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents the overview of the shared task on emotional analysis in Tamil. The result of the shared task is presented at the workshop. This paper presents the dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission. This task is organized as two Tasks. Task A is carried with 11 emotions annotated data for social media comments in Tamil and Task B is organized with 31 fine-grained emotion annotated data for social media comments in Tamil. For conducting experiments, training and development datasets were provided to the participants and results are evaluated for the unseen data. Totally we have received around 24 submissions from 13 teams. For evaluating the models, Precision, Recall, micro average metrics are used.

Co-authors

Santhiya Pandiyan 4

Thenmozhi Durairaj 3

Vasantharan K 3

Subalalitha Chinnaudayar Navaneethakrishnan 3

Praveenkumar C 2

Motheeswaran K 2

Parameswari Krishnamurthy 2

Abirami Murugappan 2

Jeevaananth S 2

Jahaganapathi S 2

Ananthakumar S 2

Sathiyaraj Thangasamy 2

Siddhanth U Hegde 2

Palanimurugan V 2

Hariprasath .s.b 1

Chandramukhii A 1

Jerin Mahibha C 1

Janakiram Chandu 1

Shunmuga Priya Muthusamy Chinnan 1

Subalalitha Cn 1

Pavul chinnappan D 1

Naveenram C E 1

Yaswanth Raj E 1

Mohamed Arsath H 1

Jayanthjr J R 1

Sowbharanika J S 1

Sree Harene J S 1

Roshini Priya K 1

Naveen Kumar K 1

Suresh Babu K 1

Sowbarnigaa K S 1

Muthu Karuppan P 1

Pramoth Kumar M 1

Mehal Sakthi M S 1

Shankar Mahadevan 1

Kavin Vishnu N 1

Nishdharani P 1

Yasvanth Bala P 1

Rahul Ponnusamy 1

Kishore Kumar Ponnusamy 1

Kishore Ponnusamy 1

Gokulkrishna R 1

Pratik Anil Rahood 1

Sandhiya Raja 1

Saranya Rajiakodi 1

Charmathi Rajkumar 1

VetriVendhan S 1

Karthickeyan S 1

Vijayakumaran S 1

Sathiyaseelan S 1

Karnati Sai Prashanth 1

Mangamuru Sai Rishith Reddy 1

Mohammed Sameer 1

Mohammed Sameer B 1

Kayalvizhi Sampath 1

Anbukkarasi Sampath 1

Shri Sashmitha.s 1

Muskaan Singh 1

Bhuvaneswari Sivagnanam 1

Sowbharanika Janani Sivakumar 1

Sangeetha Sivanesan 1

Kanchana Sivanraju 1

CN Subalalitha 1

Keerthibala T 1

Sajeetha Thavareesan 1

Mithun Chakravarthy Y 1

Venues