Advaitha Vetagiri


2024

pdf bib
MULTILATE: A Synthetic Dataset on AI-Generated MULTImodaL hATE Speech
Advaitha Vetagiri | Eisha Halder | Ayanangshu Das Majumder | Partha Pakray | Amitava Das
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

One of the pressing challenges society faces today is the rapid proliferation of online hate speech, exacerbated by the rise of AI-generated multimodal hate content. This new form of synthetically produced hate speech presents unprecedented challenges in detection and moderation. In response to the growing presence of such harmful content across social media platforms, this research introduces a groundbreaking solution:

pdf bib
Detecting Hate Speech and Fake Narratives in Code-Mixed Hinglish Social Media Text
Advaitha Vetagiri | Partha Pakray
Proceedings of the 21st International Conference on Natural Language Processing (ICON): Shared Task on Decoding Fake Narratives in Spreading Hateful Stories (Faux-Hate)

The increasing prevalence of hate speech and fake narratives on social media platforms posessignificant societal challenges. This study ad-dresses these issues through the developmentof robust machine learning models for twotasks: (1) detecting hate speech and fake nar-ratives (Task A) and (2) predicting the targetand severity of hateful content (Task B) incode-mixed Hindi-English text. We proposefour separate CNN-BiLSTM models tailoredfor each subtask. The models were evaluatedusing validation and 5-fold cross-validationdatasets, achieving F1-scores of 74% and 79%for hate and fake detection, respectively, and63% and 54% for target and severity predic-tion and achieved 65% and 57% for testingresults. The results highlight the models’ effec-tiveness in handling the nuances of code-mixedtext while underscoring the challenges of under-represented classes. This work contributes tothe ongoing effort to develop automated toolsfor detecting and mitigating harmful contentonline, paving the way for safer and more in-clusive digital spaces.

pdf bib
Findings of WMT 2024 Shared Task on Low-Resource Indic Languages Translation
Partha Pakray | Santanu Pal | Advaitha Vetagiri | Reddi Krishna | Arnab Kumar Maji | Sandeep Dash | Lenin Laitonjam | Lyngdoh Sarah | Riyanka Manna
Proceedings of the Ninth Conference on Machine Translation

This paper presents the results of the low-resource Indic language translation task, organized in conjunction with the Ninth Conference on Machine Translation (WMT) 2024. In this edition, participants were challenged to develop machine translation models for four distinct language pairs: English–Assamese, English-Mizo, English-Khasi, and English-Manipuri. The task utilized the enriched IndicNE-Corp1.0 dataset, which includes an extensive collection of parallel and monolingual corpora for northeastern Indic languages. The evaluation was conducted through a comprehensive suite of automatic metrics—BLEU, TER, RIBES, METEOR, and ChrF—supplemented by meticulous human assessment to measure the translation systems’ performance and accuracy. This initiative aims to drive advancements in low-resource machine translation and make a substantial contribution to the growing body of knowledge in this dynamic field.

2023

pdf bib
Multilingual Multimodal Text Detection in Indo-Aryan Languages
Nihar Jyoti Basisth | Eisha Halder | Tushar Sachan | Advaitha Vetagiri | Partha Pakray
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Multi-language text detection and recognition in complex visual scenes is an essential yet challenging task. Traditional pipelines relying on optical character recognition (OCR) often fail to generalize across different languages, fonts, orientations and imaging conditions. This work proposes a novel approach using the YOLOv5 object detection model architecture for multilanguage text detection in images and videos. We curate and annotate a new dataset of over 4,000 scene text images across 4 Indian languages and use specialized data augmentation techniques to improve model robustness. Transfer learning from a base YOLOv5 model pretrained on COCO is combined with tailored optimization strategies for multi-language text detection. Our approach achieves state-of-theart performance, with over 90% accuracy on multi-language text detection across all four languages in our test set. We demonstrate the effectiveness of fine-tuning YOLOv5 for generalized multi-language text extraction across diverse fonts, scales, orientations, and visual contexts. Our approach’s high accuracy and generalizability could enable numerous applications involving multilingual text processing from imagery and video.

pdf bib
CNLP-NITS at SemEval-2023 Task 10: Online sexism prediction, PREDHATE!
Advaitha Vetagiri | Prottay Adhikary | Partha Pakray | Amitava Das
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Online sexism is a rising issue that threatens women’s safety, fosters hostile situations, and upholds social inequities. We describe a task SemEval-2023 Task 10 for creating English-language models that can precisely identify and categorize sexist content on internet forums and social platforms like Gab and Reddit as well to provide an explainability in order to address this problem. The problem is divided into three hierarchically organized subtasks: binary sexism detection, sexism by category, and sexism by fine-grained vector. The dataset consists of 20,000 labelled entries. For Task A, pertained models like Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM), which is called CNN-BiLSTM and Generative Pretrained Transformer 2 (GPT-2) models were used, as well as the GPT-2 model for Task B and C, and have provided experimental configurations. According to our findings, the GPT-2 model performs better than the CNN-BiLSTM model for Task A, while GPT-2 is highly accurate for Tasks B and C on the training, validation and testing splits of the training data provided in the task. Our proposed models allow researchers to create more precise and understandable models for identifying and categorizing sexist content in online forums, thereby empowering users and moderators.