Sourav Saha


2024

pdf bib
BnPC: A Gold Standard Corpus for Paraphrase Detection in Bangla, and its Evaluation
Sourav Saha | Zeshan Ahmed Nobin | Mufassir Ahmad Chowdhury | Md. Shakirul Hasan Khan Mobin | Mohammad Ruhul Amin | Sudipta Kar
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

2023

pdf bib
Vio-Lens: A Novel Dataset of Annotated Social Network Posts Leading to Different Forms of Communal Violence and its Evaluation
Sourav Saha | Jahedul Alam Junaed | Maryam Saleki | Arnab Sen Sharma | Mohammad Rashidujjaman Rifat | Mohamed Rahouti | Syed Ishtiaque Ahmed | Nabeel Mohammed | Mohammad Ruhul Amin
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

This paper presents a computational approach for creating a dataset on communal violence in the context of Bangladesh and West Bengal of India and benchmark evaluation. In recent years, social media has been used as a weapon by factions of different religions and backgrounds to incite hatred, resulting in physical communal violence and causing death and destruction. To prevent such abusive use of online platforms, we propose a framework for classifying online posts using an adaptive question-based approach. We collected more than 168,000 YouTube comments from a set of manually selected videos known for inciting violence in Bangladesh and West Bengal. Using both unsupervised and later semi-supervised topic modeling methods on those unstructured data, we discovered the major word clusters to interpret the related topics of peace and violence. Topic words were later used to select 20,142 posts related to peace and violence of which we annotated a total of 6,046 posts. Finally, we applied different modeling techniques based on linguistic features, and sentence transformers to benchmark the labeled dataset with the best-performing model reaching ~71% macro F1 score.

pdf bib
BLP-2023 Task 1: Violence Inciting Text Detection (VITD)
Sourav Saha | Jahedul Alam Junaed | Maryam Saleki | Mohamed Rahouti | Nabeel Mohammed | Mohammad Ruhul Amin
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

We present the comprehensive technical description of the outcome of the BLP shared task on Violence Inciting Text Detection (VITD).In recent years, social media has become a tool for groups of various religions and backgrounds to spread hatred, leading to physicalviolence with devastating consequences. To address this challenge, the VITD shared task was initiated, aiming to classify the level of violence incitement in various texts. The competition garnered significant interest with a total of 27 teams consisting of 88 participants successfully submitting their systems to the CodaLab leaderboard. During the post-workshop phase, we received 16 system papers on VITD from those participants. In this paper, we intend to discuss the VITD baseline performance, error analysis of the submitted models, and provide a comprehensive summary of the computational techniques applied by the participating teams

pdf bib
garNER at SemEval-2023: Simplified Knowledge Augmentation for Multilingual Complex Named Entity Recognition
Md Zobaer Hossain | Averie Ho Zoen So | Silviya Silwal | H. Andres Gonzalez Gongora | Ahnaf Mozib Samin | Jahedul Alam Junaed | Aritra Mazumder | Sourav Saha | Sabiha Tahsin Soha
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper presents our solution, garNER, to the SemEval-2023 MultiConer task. We propose a knowledge augmentation approach by directly querying entities from the Wikipedia API and appending the summaries of the entities to the input sentence. These entities are either retrieved from the labeled training set (Gold Entity) or from off-the-shelf entity taggers (Entity Extractor). Ensemble methods are then applied across multiple models to get the final prediction. Our analysis shows that the added contexts are beneficial only when such contexts are relevant to the target-named entities, but detrimental when the contexts are irrelevant.