Advaitha Vetagiri

2025

This study proposes the results of the lowresource Indic language translation task organized in collaboration with the Tenth Conference on Machine Translation (WMT) 2025. In this workshop, participants were required to build and develop machine translation models for the seven language pairs, which were categorized into two categories. Category 1 is moderate training data available in languages i.e English–Assamese, English–Mizo, English-Khasi, English–Manipuri and English– Nyishi. Category 2 has very limited training data available in languages, i.e English–Bodo and English–Kokborok. This task leverages the enriched IndicNE-corp1.0 dataset, which consists of an extensive collection of parallel and monilingual corpora for north eastern Indic languages. The participant results were evaluated using automatic machine translation metrics, including BLEU, TER, ROUGE-L, ChrF, and METEOR. Along with those metrics, this year’s work also includes Cosine similarity for evaluation, which captures the semantic representation of the sentence to measure the performance and accuracy of the models. This work aims to promote innovation and advancements in low-resource Indic languages.

2024

pdf bib abs

This paper presents the results of the low-resource Indic language translation task, organized in conjunction with the Ninth Conference on Machine Translation (WMT) 2024. In this edition, participants were challenged to develop machine translation models for four distinct language pairs: English–Assamese, English-Mizo, English-Khasi, and English-Manipuri. The task utilized the enriched IndicNE-Corp1.0 dataset, which includes an extensive collection of parallel and monolingual corpora for northeastern Indic languages. The evaluation was conducted through a comprehensive suite of automatic metrics—BLEU, TER, RIBES, METEOR, and ChrF—supplemented by meticulous human assessment to measure the translation systems’ performance and accuracy. This initiative aims to drive advancements in low-resource machine translation and make a substantial contribution to the growing body of knowledge in this dynamic field.

pdf bib abs

MULTILATE: A Synthetic Dataset on AI-Generated MULTImodaL hATE Speech
Advaitha Vetagiri | Eisha Halder | Ayanangshu Das Majumder | Partha Pakray | Amitava Das
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

One of the pressing challenges society faces today is the rapid proliferation of online hate speech, exacerbated by the rise of AI-generated multimodal hate content. This new form of synthetically produced hate speech presents unprecedented challenges in detection and moderation. In response to the growing presence of such harmful content across social media platforms, this research introduces a groundbreaking solution:

pdf bib abs

Detecting Hate Speech and Fake Narratives in Code-Mixed Hinglish Social Media Text
Advaitha Vetagiri | Partha Pakray
Proceedings of the 21st International Conference on Natural Language Processing (ICON): Shared Task on Decoding Fake Narratives in Spreading Hateful Stories (Faux-Hate)

The increasing prevalence of hate speech and fake narratives on social media platforms posessignificant societal challenges. This study ad-dresses these issues through the developmentof robust machine learning models for twotasks: (1) detecting hate speech and fake nar-ratives (Task A) and (2) predicting the targetand severity of hateful content (Task B) incode-mixed Hindi-English text. We proposefour separate CNN-BiLSTM models tailoredfor each subtask. The models were evaluatedusing validation and 5-fold cross-validationdatasets, achieving F1-scores of 74% and 79%for hate and fake detection, respectively, and63% and 54% for target and severity predic-tion and achieved 65% and 57% for testingresults. The results highlight the models’ effec-tiveness in handling the nuances of code-mixedtext while underscoring the challenges of under-represented classes. This work contributes tothe ongoing effort to develop automated toolsfor detecting and mitigating harmful contentonline, paving the way for safer and more in-clusive digital spaces.

2023

pdf bib abs

Multilingual Multimodal Text Detection in Indo-Aryan Languages
Nihar Jyoti Basisth | Eisha Halder | Tushar Sachan | Advaitha Vetagiri | Partha Pakray
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Multi-language text detection and recognition in complex visual scenes is an essential yet challenging task. Traditional pipelines relying on optical character recognition (OCR) often fail to generalize across different languages, fonts, orientations and imaging conditions. This work proposes a novel approach using the YOLOv5 object detection model architecture for multilanguage text detection in images and videos. We curate and annotate a new dataset of over 4,000 scene text images across 4 Indian languages and use specialized data augmentation techniques to improve model robustness. Transfer learning from a base YOLOv5 model pretrained on COCO is combined with tailored optimization strategies for multi-language text detection. Our approach achieves state-of-theart performance, with over 90% accuracy on multi-language text detection across all four languages in our test set. We demonstrate the effectiveness of fine-tuning YOLOv5 for generalized multi-language text extraction across diverse fonts, scales, orientations, and visual contexts. Our approach’s high accuracy and generalizability could enable numerous applications involving multilingual text processing from imagery and video.

pdf bib abs

CNLP-NITS at SemEval-2023 Task 10: Online sexism prediction, PREDHATE!
Advaitha Vetagiri | Prottay Adhikary | Partha Pakray | Amitava Das
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Online sexism is a rising issue that threatens women’s safety, fosters hostile situations, and upholds social inequities. We describe a task SemEval-2023 Task 10 for creating English-language models that can precisely identify and categorize sexist content on internet forums and social platforms like Gab and Reddit as well to provide an explainability in order to address this problem. The problem is divided into three hierarchically organized subtasks: binary sexism detection, sexism by category, and sexism by fine-grained vector. The dataset consists of 20,000 labelled entries. For Task A, pertained models like Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM), which is called CNN-BiLSTM and Generative Pretrained Transformer 2 (GPT-2) models were used, as well as the GPT-2 model for Task B and C, and have provided experimental configurations. According to our findings, the GPT-2 model performs better than the CNN-BiLSTM model for Task A, while GPT-2 is highly accurate for Tasks B and C on the training, validation and testing splits of the training data provided in the task. Our proposed models allow researchers to create more precise and understandable models for identifying and categorizing sexist content in online forums, thereby empowering users and moderators.