Thenmozhi Durairaj - ACL Anthology

Thenmozhi Durairaj

Also published as: Durairaj Thenmozhi

2025

Hydrangea@DravidianLanTech2025: Abusive language Identification from Tamil and Malayalam Text using Transformer Models
Shanmitha Thirumoorthy | Thenmozhi Durairaj | Ratnavel Rajalakshmi
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Abusive language toward women on the Internet has always been perceived as a danger to free speech and safe online spaces. In this paper, we discuss three transformer-based models - BERT, XLM-RoBERTa, and DistilBERT-in identifying gender-abusive comments in Tamil and Malayalam YouTube contents. We fine-tune and compare these models using a dataset provided by DravidianLangTech 2025 shared task for identifying the abusive content from social media. Compared to the models above, the results of XLM-RoBERTa are better and reached F1 scores of 0.7708 for Tamil and 0.6876 for Malayalam. BERT followed with scores of 0.7658 (Tamil) and 0.6671 (Malayalam). Of the DistilBERTs, performance was varyingly different for the different languages. A large difference in performance between the models, especially in the case of Malayalam, indicates that working in low-resource languages is difficult. The choice of a model is extremely critical in applying abusive language detection. The findings would be important information for effective content moderation systems in linguistically diverse contexts. In general, it would promote safe online spaces for women in South Indian language communities.

Overview of the Shared Task on Detecting AI Generated Product Reviews in Dravidian Languages: DravidianLangTech@NAACL 2025
Premjith B | Nandhini Kumaresh | Bharathi Raja Chakravarthi | Thenmozhi Durairaj | Balasubramanian Palani | Sajeetha Thavareesan | Prasanna Kumar Kumaresan
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The detection of AI-generated product reviews is critical due to the increased use of large language models (LLMs) and their capability to generate convincing sentences. The AI-generated reviews can affect the consumers and businesses as they influence the trust and decision-making. This paper presents the overview of the shared task on Detecting AI-generated product reviews in Dravidian Languages” organized as part of DravidianLangTech@NAACL 2025. This task involves two subtasks—one in Malayalam and another in Tamil, both of which are binary classifications where a review is to be classified as human-generated or AI-generated. The dataset was curated by collecting comments from YouTube videos. Various machine learning and deep learning-based models ranging from SVM to transformer-based architectures were employed by the participants.

Overview of the Shared Task on Detecting Racial Hoaxes in Code-Mixed Hindi-English Social Media Data
Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Shanu Dhawale | Saranya Rajiakodi | Sajeetha Thavareesan | Subalalitha Chinnaudayar Navaneethakrishnan | Thenmozhi Durairaj
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

The widespread use of social media has made it easier for false information to proliferate, particularly racially motivated hoaxes that can encourage violence and hatred. Such content is frequently shared in code-mixed languages in multilingual nations like India, which presents special difficulties for automated detection systems because of the casual language, erratic grammar, and rich cultural background. The shared task on detecting racial hoaxes in code mixed social media data aims to identify the racial hoaxes in Hindi-English data. It is a binary classification task with more than 5,000 labeled instances. A total of 11 teams participated in the task, and the results are evaluated using the macro-F1 score. The team that employed XLM-RoBERTa secured the first position in the task.

JAS@DravidianLangTech 2025: Abusive Tamil Text targeting Women on Social Media
B Saathvik | Janeshvar Sivakumar | Thenmozhi Durairaj
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This paper presents our submission for Abusive Comment Detection in Tamil - DravidianLangTech@NAACL 2025. The aim is to classify whether a given comment is abusive towards women. Google’s MuRIL (Khanujaet al., 2021), a transformer-based multilingual model, is fine-tuned using the provided dataset to build the classification model. The datasetis preprocessed, tokenised, and formatted for model training. The model is trained and evaluated using accuracy, F1-score, precision, andrecall. Our approach achieved an evaluation accuracy of 77.76% and an F1-score of 77.65%. The lack of large, high-quality datasets forlow-resource languages has also been acknowledged.

NLP_goats@DravidianLangTech 2025: Detecting Fake News in Dravidian Languages: A Text Classification Approach
Srihari V K | Vijay Karthick Vaidyanathan | Thenmozhi Durairaj
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The advent and expansion of social media have transformed global communication. Despite its numerous advantages, it has also created an avenue for the rapid spread of fake news, which can impact people’s decision-making and judgment. This study explores detecting fake news as part of the DravidianLangTech@NAACL 2025 shared task, focusing on two key tasks. The aim of Task 1 is to classify Malayalam social media posts as either original or fake, and Task 2 categorizes Malayalam-language news articles into five levels of truthfulness: False, Half True, Mostly False, Partly False, and Mostly True. We accomplished the tasks using transformer models, e.g., M-BERT and classifiers like Naive Bayes. Our results were promising, with M-BERT achieving the better results. We achieved a macro-F1 score of 0.83 for distinguishing between fake and original content in Task 1 and a score of 0.54 for classifying news articles in Task 2, ranking us 11 and 4, respectively.

Overview of the Shared Task on Sentiment Analysis in Tamil and Tulu
Durairaj Thenmozhi | Bharathi Raja Chakravarthi | Asha Hegde | Hosahalli Lakshmaiah Shashirekha | Rajeswari Natarajan | Sajeetha Thavareesan | Ratnasingam Sakuntharaj | Krishnakumari Kalyanasundaram | Charmathi Rajkumar | Poorvi Shetty | Harshitha S Kumar
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Sentiment analysis is an essential task for interpreting subjective opinions and emotions in textual data, with significant implications across commercial and societal applications. This paper provides an overview of the shared task on Sentiment Analysis in Tamil and Tulu, organized as part of DravidianLangTech@NAACL 2025. The task comprises two components: one addressing Tamil and the other focusing on Tulu, both designed as multi-class classification challenges, wherein the sentiment of a given text must be categorized as positive, negative, neutral and unknown. The dataset was diligently organized by aggregating user-generated content from social media platforms such as YouTube and Twitter, ensuring linguistic diversity and real-world applicability. Participants applied a variety of computational approaches, ranging from classical machine learning algorithms such as Traditional Machine Learning Models, Deep Learning Models, Pre-trained Language Models and other Feature Representation Techniques to tackle the challenges posed by linguistic code-mixing, orthographic variations, and resource scarcity in these low resource languages.

Overview on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Sathiyaraj Thangasamy | Ratnasingam Sakuntharaj | Prasanna Kumar Kumaresan | Kishore Kumar Ponnusamy | Arunaggiri Pandian Karunanidhi | Rohan R
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Political multiclass detection is the task of identifying the predefined seven political classes. In this paper, we report an overview of the findings on the “Political Multiclass Sentiment Analysis of Tamil X(Twitter) Comments” shared task conducted at the workshop on DravidianLangTech@NAACL 2025. The participants were provided with annotated Twitter comments, which are split into training, development, and unlabelled test datasets. A total of 139 participants registered for this shared task, and 25 teams finally submitted their results. The performance of the submitted systems was evaluated and ranked in terms of the macro-F1 score.

NLP_goats@DravidianLangTech 2025: Towards Safer Social Media: Detecting Abusive Language Directed at Women in Dravidian Languages
Vijay Karthick Vaidyanathan | Srihari V K | Thenmozhi Durairaj
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Social media in the present world is an essential communication platform for information sharing. But their emergence has now led to an increase in the proportion of online abuse, in particular against women in the form of abusive and offensive messages. A reflection of the social inequalities, the importance of detecting abusive language is highlighted by the fact that the usage has a profound psychological and social impact on the victims. This work by DravidianLangTech@NAACL 2025 aims at developing an automated abusive content detection system for women directed towards women on the Tamil and Malayalam platforms, two of the Dravidian languages. Based on a dataset of their YouTube comments about sensitive issues, the study uses multilingual BERT (mBERT) to detect abusive comments versus non-abusive ones. We achieved F1 scores of 0.75 in Tamil and 0.68 in Malayalam, placing us 13 and 9 respectively.

NLP_goats_DravidianLangTech_2025__Detecting_AI_Written_Reviews_for_Consumer_Trust
Srihari V K | Vijay Karthick Vaidyanathan | Mugilkrishna D U | Thenmozhi Durairaj
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rise of AI-generated content has introduced challenges in distinguishing machine-generated text from human-written text, particularly in low-resource languages. The identification of artificial intelligence (AI)-based reviews is of significant importance to preserve trust and authenticity on online platforms. The Shared Task on Detecting AI-Generated Product Reviews in Dravidian languages deals with the task of detecting AI-generated and human-written reviews in Tamil and Malayalam. To solve this problem, we specifically fine-tuned mBERT for binary classification. Our system achieved 10th place in Tamil with a macro F1-score of 0.90 and 28th place in Malayalam with a macro F1-score of 0.68, as reported by the NAACL 2025 organizers. The findings demonstrate the complexity involved in the separation of AI-derived text from human-authored writing, with a call for continued advances in detection methods.

shimig@DravidianLangTech2025: Stratification of Abusive content on Women in Social Media
Gersome Shimi | Jerin Mahibha C | Thenmozhi Durairaj
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The social network is a trending medium for interaction and sharing content globally. The content is sensitive since it can create an impact and change the trends of stakeholder’s thought as well as behavior. When the content is targeted towards women, it may be abusive or non-abusive and the identification is a tedious task. The content posted on social networks can be in English, code mix, or any low-resource language. The shared task Abusive Tamil and Malayalam Text targeting Women on Social Media was conducted as part of DravidianLangTech@NAACL 2025 organized by DravidianLangTech. The task is to identify the content given in Tamil or Malayalam or code mix as abusive or non-abusive. The task is accomplished for the South Indian languages Tamil and Malayalam using pretrained transformer model, BERT base multilingual cased and achieved the accuracy measure of 0.765 and 0.677.

2024

WordWizards@DravidianLangTech 2024: Sentiment Analysis in Tamil and Tulu using Sentence Embedding
Shreedevi Balaji | Akshatha Anbalagan | Priyadharshini T | Niranjana A | Durairaj Thenmozhi
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Sentiment Analysis of Dravidian Languages has begun to garner attention recently as there is more need to analyze emotional responses and subjective opinions present in social media text. As this data is code-mixed and there are not many solutions to code-mixed text out there, we present to you a stellar solution to DravidianLangTech 2024: Sentiment Analysis in Tamil and Tulu task. To understand the sentiment of social media text, we used pre-trained transformer models and feature extraction vectorizers to classify the data with results that placed us 11th in the rankings for the Tamil task and 8th for the Tulu task with a accuracy F1 score of 0.12 and 0.30 which shows the efficiency of our approach.

WordWizards@DravidianLangTech 2024:Fake News Detection in Dravidian Languages using Cross-lingual Sentence Embeddings
Akshatha Anbalagan | Priyadharshini T | Niranjana A | Shreedevi Balaji | Durairaj Thenmozhi
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The proliferation of fake news in digital media has become a significant societal concern, impacting public opinion, trust, and decision-making. This project focuses on the development of machine learning models for the detection of fake news. Leveraging a dataset containing both genuine and deceptive news articles, the proposed models employ natural language processing techniques, feature extraction and classification algorithms. This paper provides a solution to Fake News Detection in Dravidian Languages - DravidianLangTech 2024. There are two sub tasks: Task 1 - The goal of this task is to classify a given social media text into original or fake. We propose an approach for this with the help of a supervised machine learning model – SVM (Support Vector Machine). The SVM classifier achieved a macro F1 score of 0.78 in test data and a rank 11. The Task 2 is classifying fake news articles in Malayalam language into different categories namely False, Half True, Mostly False, Partly False and Mostly True.We have used Naive Bayes which achieved macro F1-score 0.3517 in test data and a rank 6.

Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
Bharathi Raja Chakravarthi | Bharathi B | Paul Buitelaar | Thenmozhi Durairaj | György Kovács | Miguel Ángel García Cumbreras
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Quartet@LT-EDI 2024: A Support Vector Machine Approach For Caste and Migration Hate Speech Detection
Shaun H | Samyuktaa Sivakumar | Rohan R | Nikilesh Jayaguptha | Durairaj Thenmozhi
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Hate speech refers to the offensive remarks against a community or individual based on inherent characteristics. Hate speech against a community based on their caste and native are unfortunately prevalent in the society. Especially with social media platforms being a very popular tool for communication and sharing ideas, people post hate speech against caste or migrants on social medias. The Shared Task LT–EDI 2024: Caste and Migration Hate Speech Detection was created with the objective to create an automatic classification system that detects and classifies hate speech posted on social media targeting a community belonging to a particular caste and migrants. Datasets in Tamil language were provided along with the shared task. We experimented with several traditional models such as Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Random Forest Classifier and Decision Tree Classifier out of which Support Vector Machine yielded the best results placing us 8th in the rank list released by the organizers.

Overview of Second Shared Task on Sentiment Analysis in Code-mixed Tamil and Tulu
Lavanya Sambath Kumar | Asha Hegde | Bharathi Raja Chakravarthi | Hosahalli Shashirekha | Rajeswari Natarajan | Sajeetha Thavareesan | Ratnasingam Sakuntharaj | Thenmozhi Durairaj | Prasanna Kumar Kumaresan | Charmathi Rajkumar
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Sentiment Analysis (SA) in Dravidian codemixed text is a hot research area right now. In this regard, the “Second Shared Task on SA in Code-mixed Tamil and Tulu” at Dravidian- LangTech (EACL-2024) is organized. Two tasks namely SA in Tamil-English and Tulu- English code-mixed data, make up this shared assignment. In total, 64 teams registered for the shared task, out of which 19 and 17 systems were received for Tamil and Tulu, respectively. The performance of the systems submitted by the participants was evaluated based on the macro F1-score. The best method obtained macro F1-scores of 0.260 and 0.584 for code-mixed Tamil and Tulu texts, respectively.

Quartet@LT-EDI 2024: A SVM-ResNet50 Approach For Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes
Shaun H | Samyuktaa Sivakumar | Rohan R | Nikilesh Jayaguptha | Durairaj Thenmozhi
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Meme is a very popular term prevailing among almost all social media platforms in recent days. A meme can be a combination of text and image whose sole purpose is meant to be funny and entertain people. Memes can sometimes promote misogynistic content expressing hatred, contempt, or prejudice against women. The Shared Task LT–EDI 2024: Multitask Meme Classification: Unraveling Misogynistic and Trolls in Online Memes Task 1 was created with the purpose to classify social media memes as “misogynistic” and “Non - Misogynistic”. The task encompassed Tamil and Malayalam datasets. We separately classified the textual data using Multinomial Naive Bayes and pictorial data using ResNet50 model. The results of from both data were combined to yield an overall result. We were ranked 2nd for both languages in this task.

Quartet@LT-EDI 2024: Support Vector Machine Based Approach For Homophobia/Transphobia Detection In Social Media Comments
Shaun H | Samyuktaa Sivakumar | Rohan R | Nikilesh Jayaguptha | Durairaj Thenmozhi
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Homophobia and transphobia are terms which are used to describe the fear or hatred towards people who are attracted to the same sex or people whose psychological gender differs from his biological sex. People use social media to exert this behaviour. The increased amount of abusive content negatively affects people in a lot of ways. It makes the environment toxic and unpleasant to LGBTQ+ people. The paper talks about the classification model for classifying the contents into 3 categories which are homophobic, transphobic and nonhomophobic/ transphobic. We used many traditional models like Support Vector Machine, Random Classifier, Logistic Regression and KNearest Neighbour to achieve this. The macro average F1 scores for Malayalam, Telugu, English, Marathi, Kannada, Tamil, Gujarati, Hindi are 0.88, 0.94, 0.96, 0.78, 0.93, 0.77, 0.94, 0.47 and the rank for these languages are 5, 6, 9, 6, 8, 6, 6, 4.

2023

Overview of the shared task on Detecting Signs of Depression from Social Media Text
Kayalvizhi Sampath | Durairaj Thenmozhi | Bharathi Raja Chakravarthi | Jerin Mahibha C | Kogilavani Shanmugavadivel | Pratik Anil Rahood
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Social media has become a vital platform for personal communication. Its widespread use as a primary means of public communication offers an exciting opportunity for early detection and management of mental health issues. People often share their emotions on social media, but understanding the true depth of their feelings can be challenging. Depression, a prevalent problem among young people, is of particular concern due to its link with rising suicide rates. Identifying depression levels in social media texts is crucial for timely support and prevention of negative outcomes. However, it’s a complex task because human emotions are dynamic and can change significantly over time. The DepSign-LT-EDI@RANLP 2023 shared task aims to classify social media text into three depression levels: “Not Depressed,” “Moderately Depressed,” and “Severely Depressed.” This overview covers task details, dataset, methodologies used, and results analysis. Roberta-based models emerged as top performers, with the best result achieving an impressive macro F1-score of 0.584 among 31 participating teams.

Brainstormers_msec at SemEval-2023 Task 10: Detection of sexism related comments in social media using deep learning
C. Jerin Mahibha | C. M Swaathi | R. Jeevitha | R. Princy Martina | Durairaj Thenmozhi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Social media is the media through which people share their thoughts and opinions. This has both its pros and cons which depends on the type of information being conveyed. If any information conveyed over social media hurts or affects a person, such information can be removed as it may disturb their mental health and may decrease their self confidence. During the last decade, hateful and sexist content towards women in being increasingly spread on social networks. The exposure to sexist speech has serious consequences to women’s life and limits their freedom of speech. Sexism is expressed in very different forms: it includes subtle stereotypes and attitudes that, although frequently unnoticed, are extremely harmful for both women and society. Sexist comments have a major impact on women being subjected to it. We as a team participated in the shared task Explainable Detection of Online Sexism (EDOS) at SemEval 2023 and have proposed a model which identifies the sexist comments and its type from English social media posts using the data set shared for the task. Different transformer model like BERT , DistilBERT and RoBERT are used by the proposed model for implementing all the three tasks shared by EDOS. On using the BERT model, macro F1 score of 0.8073, 0.5876 and 0.3729 are achieved for Task A, Task B and Task C respectively.

Findings of the Shared Task on Sentiment Analysis in Tamil and Tulu Code-Mixed Text
Asha Hegde | Bharathi Raja Chakravarthi | Hosahalli Lakshmaiah Shashirekha | Rahul Ponnusamy | Subalalitha Chinnaudayar Navaneethakrishnan | Lavanya Sambath Kumar | Durairaj Thenmozhi | Martha Karunakar | Shreya Sriram | Sarah Aymen
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

In recent years, there has been a growing focus on Sentiment Analysis (SA) of code-mixed Dravidian languages. However, the majority of social media text in these languages is code-mixed, presenting a unique challenge. Despite this, there is currently lack of research on SA specifically tailored for code-mixed Dravidian languages, highlighting the need for further exploration and development in this domain. In this view, “Sentiment Analysis in Tamil and Tulu- DravidianLangTech” shared task at Recent Advances in Natural Language Processing (RANLP)- 2023 is organized. This shred consists two language tracks: code-mixed Tamil and Tulu and Tulu text is first ever explored in public domain for SA. We describe the task, its organization, and the submitted systems followed by the results. 57 research teams registered for the shared task and We received 27 systems each for code-mixed Tamil and Tulu texts. The performance of the systems (developed by participants) has been evaluated in terms of macro average F1 score. The top system for code-mixed Tamil and Tulu texts scored macro average F1 score of 0.32, and 0.542 respectively. The high quality and substantial quantity of submissions demonstrate a significant interest and attention in the analysis of code-mixed Dravidian languages. However, the current state of the art in this domain indicates the need for further advancements and improvements to effectively address the challenges posed by code-mixed Dravidian language SA.

2022

Overview of Abusive Comment Detection in Tamil-ACL 2022
Ruba Priyadharshini | Bharathi Raja Chakravarthi | Subalalitha Chinnaudayar Navaneethakrishnan | Thenmozhi Durairaj | Malliga Subramanian | Kogilavani Shanmugavadivel | Siddhanth U Hegde | Prasanna Kumar Kumaresan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

The social media is one of the significantdigital platforms that create a huge im-pact in peoples of all levels. The commentsposted on social media is powerful enoughto even change the political and businessscenarios in very few hours. They alsotend to attack a particular individual ora group of individuals. This shared taskaims at detecting the abusive comments in-volving, Homophobia, Misandry, Counter-speech, Misogyny, Xenophobia, Transpho-bic. The hope speech is also identified. Adataset collected from social media taggedwith the above said categories in Tamiland Tamil-English code-mixed languagesare given to the participants. The par-ticipants used different machine learningand deep learning algorithms. This paperpresents the overview of this task compris-ing the dataset details and results of theparticipants.

Overview of The Shared Task on Homophobia and Transphobia Detection in Social Media Comments
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Durairaj Thenmozhi | John Philip McCrae | Paul Buitelaar | Rahul Ponnusamy | Prasanna Kumar Kumaresan
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Homophobia and Transphobia Detection is the task of identifying homophobia, transphobia, and non-anti-LGBT+ content from the given corpus. Homophobia and transphobia are both toxic languages directed at LGBTQ+ individuals that are described as hate speech. This paper summarizes our findings on the “Homophobia and Transphobia Detection in social media comments” shared task held at LT-EDI 2022 - ACL 2022 1. This shared taskfocused on three sub-tasks for Tamil, English, and Tamil-English (code-mixed) languages. It received 10 systems for Tamil, 13 systems for English, and 11 systems for Tamil-English. The best systems for Tamil, English, and Tamil-English scored 0.570, 0.870, and 0.610, respectively, on average macro F1-score.

SSN_NLP_MLRG at SemEval-2022 Task 4: Ensemble Learning strategies to detect Patronizing and Condescending Language
Kalaivani Adaikkan | Thenmozhi Durairaj
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In this paper, we describe our efforts at SemEval 2022 Shared Task 4 on Patronizing and Condescending Language (PCL) Detection. This is the first shared task to detect PCL which is to identify and categorize PCL language towards vulnerable communities. The shared task consists of two subtasks: Patronizing and Condescending language detection (Subtask A) which is the binary task classification and identifying the PCL categories that express the condescension (Subtask B) which is the multi-label text classification. For PCL language detection, We proposed the ensemble strategies of a system combination of BERT, Roberta, Distilbert, Roberta large, Albert achieved the official results for Subtask A with a macro f1 score of 0.5172 on the test set which is improved by baseline score. For PCL Category identification, We proposed a multi-label classification model to ensemble the various Bert-based models and the official results for Subtask B with a macro f1 score of 0.2117 on the test set which is improved by baseline score.

scubeMSEC@LT-EDI-ACL2022: Detection of Depression using Transformer Models
Sivamanikandan S | Santhosh V | Sanjaykumar N | Jerin Mahibha C | Thenmozhi Durairaj
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Social media platforms play a major role in our day-to-day life and are considered as a virtual friend by many users, who use the social media to share their feelings all day. Many a time, the content which is shared by users on social media replicate their internal life. Nowadays people love to share their daily life incidents like happy or unhappy moments and their feelings in social media and it makes them feel complete and it has become a habit for many users. Social media provides a new chance to identify the feelings of a person through their posts. The aim of the shared task is to develop a model in which the system is capable of analyzing the grammatical markers related to onset and permanent symptoms of depression. We as a team participated in the shared task Detecting Signs of Depression from Social Media Text at LT-EDI 2022- ACL 2022 and we have proposed a model which predicts depression from English social media posts using the data set shared for the task. The prediction is done based on the labels Moderate, Severe and Not Depressed. We have implemented this using different transformer models like DistilBERT, RoBERTa and ALBERT by which we were able to achieve a Macro F1 score of 0.337, 0.457 and 0.387 respectively. Our code is publicly available in the github

PANDAS@TamilNLP-ACL2022: Emotion Analysis in Tamil Text using Language Agnostic Embeddings
Divyasri K | Gayathri G L | Krithika Swaminathan | Thenmozhi Durairaj | Bharathi B | Senthil Kumar B
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

As the world around us continues to become increasingly digital, it has been acknowledged that there is a growing need for emotion analysis of social media content. The task of identifying the emotion in a given text has many practical applications ranging from screening public health to business and management. In this paper, we propose a language-agnostic model that focuses on emotion analysis in Tamil text. Our experiments yielded an F1-score of 0.010.

SSNCSE_NLP@LT-EDI-ACL2022:Hope Speech Detection for Equality, Diversity and Inclusion using sentence transformers
Bharathi B | Dhanya Srinivasan | Josephine Varsha | Thenmozhi Durairaj | Senthil Kumar B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

In recent times, applications have been developed to regulate and control the spread of negativity and toxicity on online platforms. The world is filled with serious problems like political & religious conflicts, wars, pandemics, and offensive hate speech is the last thing we desire. Our task was to classify a text into ‘Hope Speech’ and ‘Non-Hope Speech’. We searched for datasets acquired from YouTube comments that offer support, reassurance, inspiration, and insight, and the ones that don’t. The datasets were provided to us by the LTEDI organizers in English, Tamil, Spanish, Kannada, and Malayalam. To successfully identify and classify them, we employed several machine learning transformer models such as m-BERT, MLNet, BERT, XLMRoberta, and XLM_MLM. The observed results indicate that the BERT and m-BERT have obtained the best results among all the other techniques, gaining a weighted F1- score of 0.92, 0.71, 0.76, 0.87, and 0.83 for English, Tamil, Spanish, Kannada, and Malayalam respectively. This paper depicts our work for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LTEDI 2021.

SSNCSE_NLP@LT-EDI-ACL2022: Speech Recognition for Vulnerable Individuals in Tamil using pre-trained XLSR models
Dhanya Srinivasan | Bharathi B | Thenmozhi Durairaj | Senthil Kumar B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Automatic speech recognition is a tool used to transform human speech into a written form. It is used in a variety of avenues, such as in voice commands, customer, service and more. It has emerged as an essential tool in the digitisation of daily life. It has been known to be of vital importance in making the lives of elderly and disabled people much easier. In this paper we describe an automatic speech recognition model, determined by using three pre-trained models, fine-tuned from the Facebook XLSR Wav2Vec2 model, which was trained using the Common Voice Dataset. The best model for speech recognition in Tamil is determined by finding the word error rate of the data. This work explains the submission made by SSNCSE_NLP in the shared task organized by LT-EDI at ACL 2022. A word error rate of 39.4512 is achieved.

Findings of the Shared Task on Detecting Signs of Depression from Social Media
Kayalvizhi S | Thenmozhi Durairaj | Bharathi Raja Chakravarthi | Jerin Mahibha C
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Social media is considered as a platform whereusers express themselves. The rise of social me-dia as one of humanity’s most important publiccommunication platforms presents a potentialprospect for early identification and manage-ment of mental illness. Depression is one suchillness that can lead to a variety of emotionaland physical problems. It is necessary to mea-sure the level of depression from the socialmedia text to treat them and to avoid the nega-tive consequences. Detecting levels of depres-sion is a challenging task since it involves themindset of the people which can change period-ically. The aim of the DepSign-LT-EDI@ACL-2022 shared task is to classify the social me-dia text into three levels of depression namely“Not Depressed”, “Moderately Depressed”, and“Severely Depressed”. This overview presentsa description on the task, the data set, method-ologies used and an analysis on the results ofthe submissions. The models that were submit-ted as a part of the shared task had used a va-riety of technologies from traditional machinelearning algorithms to deep learning models. It could be observed from the result that thetransformer based models have outperformedthe other models. Among the 31 teams whohad submitted their results for the shared task,the best macro F1-score of 0.583 was obtainedusing transformer based model.

GetSmartMSEC at SemEval-2022 Task 6: Sarcasm Detection using Contextual Word Embedding with Gaussian model for Irony Type Identification
Diksha Krishnan | Jerin Mahibha C | Thenmozhi Durairaj
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Sarcasm refers to the use of words that have different literal and intended meanings. It represents the usage of words that are opposite of what is literally said, especially in order to insult, mock, criticise or irritate someone. These types of statements may be funny or amusing to others but may hurt or annoy the person towards whom it is intended. Identification of sarcastic phrases from social media posts finds its application in different domains like sentiment analysis, opinion mining, author profiling, and harassment detection. We have proposed a model for the shared task iSarcasmEval - Intended Sarcasm Detection in English and Arabic (CITATION) by SemEval-2022 considering the language English based on ELmo embeddings for Subtasks A and C and TF-IDF vectors and Gaussian Naive bayes classifier for Subtask B. The proposed model resulted in a F1 score 0.2012 for sarcastic texts in Subtask A, macro-F1 score of 0.0387 and 0.2794 for Subtasks B and C respectively.

PANDAS@Abusive Comment Detection in Tamil Code-Mixed Data Using Custom Embeddings with LaBSE
Gayathri G L | Krithika Swaminathan | Divyasri K | Thenmozhi Durairaj | Bharathi B
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Abusive language has lately been prevalent in comments on various social media platforms. The increasing hostility observed on the internet calls for the creation of a system that can identify and flag such acerbic content, to prevent conflict and mental distress. This task becomes more challenging when low-resource languages like Tamil, as well as the often-observed Tamil-English code-mixed text, are involved. The approach used in this paper for the classification model includes different methods of feature extraction and the use of traditional classifiers. We propose a novel method of combining language-agnostic sentence embeddings with the TF-IDF vector representation that uses a curated corpus of words as vocabulary, to create a custom embedding, which is then passed to an SVM classifier. Our experimentation yielded an accuracy of 52% and an F1-score of 0.54.

Findings of the Shared Task on Emotion Analysis in Tamil
Anbukkarasi Sampath | Thenmozhi Durairaj | Bharathi Raja Chakravarthi | Ruba Priyadharshini | Subalalitha Cn | Kogilavani Shanmugavadivel | Sajeetha Thavareesan | Sathiyaraj Thangasamy | Parameswari Krishnamurthy | Adeep Hande | Sean Benhur | Kishore Ponnusamy | Santhiya Pandiyan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents the overview of the shared task on emotional analysis in Tamil. The result of the shared task is presented at the workshop. This paper presents the dataset used in the shared task, task description, and the methodology used by the participants and the evaluation results of the submission. This task is organized as two Tasks. Task A is carried with 11 emotions annotated data for social media comments in Tamil and Task B is organized with 31 fine-grained emotion annotated data for social media comments in Tamil. For conducting experiments, training and development datasets were provided to the participants and results are evaluated for the unseen data. Totally we have received around 24 submissions from 13 teams. For evaluating the models, Precision, Recall, micro average metrics are used.

Co-authors

Senthil Kumar B 3

Nikilesh Jayaguptha 3

Subalalitha Chinnaudayar Navaneethakrishnan 3

Ruba Priyadharshini 3

Ratnasingam Sakuntharaj 3

Kogilavani Shanmugavadivel 3

Samyuktaa Sivakumar 3

Vijay Karthick Vaidyanathan 3

Akshatha Anbalagan 2

Shreedevi Balaji 2

Paul Buitelaar 2

Rajeswari Natarajan 2

Rahul Ponnusamy 2

Saranya Rajiakodi 2

Charmathi Rajkumar 2

Lavanya Sambath Kumar 2

Hosahalli Lakshmaiah Shashirekha 2

Dhanya Srinivasan 2

Krithika Swaminathan 2

Priyadharshini T 2

Sathiyaraj Thangasamy 2

Kalaivani Adaikkan 1

Subalalitha Cn 1

Shanu Dhawale 1

Miguel Ángel García-Cumbreras 1

Krishnakumari Kalyanasundaram 1

Martha Karunakar 1

Arunaggiri Pandian Karunanidhi 1

György Kovács 1

Parameswari Krishnamurthy 1

Diksha Krishnan 1

Harshitha S Kumar 1

Nandhini Kumaresh 1

C. Jerin Mahibha 1

R. Princy Martina 1

John Philip McCrae 1

Sanjaykumar N 1

Balasubramanian Palani 1

Santhiya Pandiyan 1

Kishore Ponnusamy 1

Kishore Kumar Ponnusamy 1

Pratik Anil Rahood 1

Ratnavel Rajalakshmi 1

Sivamanikandan S 1

Kayalvizhi Sampath 1

Anbukkarasi Sampath 1

Hosahalli Shashirekha 1

Poorvi Shetty 1

Gersome Shimi 1

Janeshvar Sivakumar 1

Shreya Sriram 1

Malliga Subramanian 1

Shanmitha Thirumoorthy 1

Mugilkrishna D U 1

Siddhanth U Hegde 1

Josephine Varsha 1

Venues