Sumegh Roychowdhury

2025

*The comparison between discriminative and generative classifiers has intrigued researchers since [Efron (1975)’s](https://www.jstor.org/stable/2285453) seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures—Auto-regressive, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical “two regimes” phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.*

2024

pdf bib abs

Exploring Ordinality in Text Classification: A Comparative Study of Explicit and Implicit Techniques
Siva Rajesh Kasa | Aniket Goel | Karan Gupta | Sumegh Roychowdhury | Pattisapu Priyatam | Anish Bhanushali | Prasanna Srinivasa Murthy
Findings of the Association for Computational Linguistics: ACL 2024

Ordinal Classification (OC) is a widely encountered challenge in Natural Language Processing (NLP), with applications in various domains such as sentiment analysis, rating prediction, and more. Previous approaches to tackle OC have primarily focused on modifying existing or creating novel loss functions that explicitly account for the ordinal nature of labels. However, with the advent of Pre-trained Language Models (PLMs), it became possible to tackle ordinality through the implicit semantics of the labels as well. This paper provides a comprehensive theoretical and empirical examination of both these approaches. Furthermore, we also offer strategic recommendations regarding the most effective approach to adopt based on specific settings.

2023

pdf bib abs

Data-Efficient Methods For Improving Hate Speech Detection
Sumegh Roychowdhury | Vikram Gupta
Findings of the Association for Computational Linguistics: EACL 2023

Scarcity of large-scale datasets, especially for resource-impoverished languages motivates exploration of data-efficient methods for hate speech detection. Hateful intents are expressed explicitly (use of cuss, swear, abusive words) and implicitly (indirect and contextual). In this work, we progress implicit and explicit hate speech detection using an input-level data augmentation technique, task reformulation using entailment and cross-learning across five languages. Our proposed data augmentation technique EasyMix, improves the performance across all english datasets by ~1% and across multilingual datasets by ~1-9%. We also observe substantial gains of ~2-8% by reformulating hate speech detection as entail problem. We further probe the contextual models and observe that higher layers encode implicit hate while lower layers focus on explicit hate, highlighting the importance of token-level understanding for explicit and context-level for implicit hate speech detection. Code and Dataset splits - https://anonymous.4open.science/r/data_efficient_hatedetect/

2022

pdf bib abs

CRUSH: Contextually Regularized and User anchored Self-supervised Hate speech Detection
Souvic Chakraborty | Parag Dutta | Sumegh Roychowdhury | Animesh Mukherjee
Findings of the Association for Computational Linguistics: NAACL 2022

The last decade has witnessed a surge in the interaction of people through social networking platforms. While there are several positive aspects of these social platforms, their proliferation has led them to become the breeding ground for cyber-bullying and hate speech. Recent advances in NLP have often been used to mitigate the spread of such hateful content. Since the task of hate speech detection is usually applicable in the context of social networks, we introduce CRUSH, a framework for hate speech detection using User Anchored self-supervision and contextual regularization. Our proposed approach secures ~1-12% improvement in test set metrics over best performing previous approaches on two types of tasks and multiple popular English language social networking datasets.

pdf bib abs

Representation Learning for Conversational Data using Discourse Mutual Information Maximization
Bishal Santra | Sumegh Roychowdhury | Aishik Mandal | Vasu Gurram | Atharva Naik | Manish Gupta | Pawan Goyal
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Although many pretrained models exist for text or images, there have been relatively fewer attempts to train representations specifically for dialog understanding. Prior works usually relied on finetuned representations based on generic text representation models like BERT or GPT-2. But such language modeling pretraining objectives do not take the structural information of conversational text into consideration. Although generative dialog models can learn structural features too, we argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling. We empirically demonstrate that such representations do not perform consistently across various dialog understanding tasks. Hence, we propose a structure-aware Mutual Information based loss-function DMI (Discourse Mutual Information) for training dialog-representation models, that additionally captures the inherent uncertainty in response prediction. Extensive evaluation on nine diverse dialog modeling tasks shows that our proposed DMI-based models outperform strong baselines by significant margins.

2019

pdf bib abs

IIT-KGP at MEDIQA 2019: Recognizing Question Entailment using Sci-BERT stacked with a Gradient Boosting Classifier
Prakhar Sharma | Sumegh Roychowdhury
Proceedings of the 18th BioNLP Workshop and Shared Task

Official System Description paper of Team IIT-KGP ranked 1st in the Development phase and 3rd in Testing Phase in MEDIQA 2019 - Recognizing Question Entailment (RQE) Shared Task of BioNLP workshop - ACL 2019. The number of people turning to the Internet to search for a diverse range of health-related subjects continues to grow and with this multitude of information available, duplicate questions are becoming more frequent and finding the most appropriate answers becomes problematic. This issue is important for question answering platforms as it complicates the retrieval of all information relevant to the same topic, particularly when questions similar in essence are expressed differently, and answering a given medical question by retrieving similar questions that are already answered by human experts seems to be a promising solution. In this paper, we present our novel approach to detect question entailment by determining the type of question asked rather than focusing on the type of the ailment given. This unique methodology makes the approach robust towards examples which have different ailment names but are synonyms of each other. Also, it enables us to check entailment at a much more fine-grained level. QSpider is a staged system consisting of state-of-the-art model Sci-BERT used as a multi-class classifier aimed at capturing both question types and semantic relations stacked with a Gradient Boosting Classifier which checks for entailment. QSpider achieves an accuracy score of 68.4% on the Test set which outperforms the baseline model (54.1%) by an accuracy score of 14.3%.

pdf bib abs

IIT-KGP at COIN 2019: Using pre-trained Language Models for modeling Machine Comprehension
Prakhar Sharma | Sumegh Roychowdhury
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

In this paper, we describe our system for COIN 2019 Shared Task 1: Commonsense Inference in Everyday Narrations. We show the power of leveraging state-of-the-art pre-trained language models such as BERT(Bidirectional Encoder Representations from Transformers) and XLNet over other Commonsense Knowledge Base Resources such as ConceptNet and NELL for modeling machine comprehension. We used an ensemble of BERT-Large and XLNet-Large. Experimental results show that our model give substantial improvements over the baseline and other systems incorporating knowledge bases. We bagged 2nd position on the final test set leaderboard with an accuracy of 90.5%