Somnath Basu Roy Chowdhury

Also published as: Somnath Basu Roy Chowdhury

2025

Exploring Safety-Utility Trade-Offs in Personalized Language Models
Anvesh Rao Vijjini | Somnath Basu Roy Chowdhury | Snigdha Chaturvedi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As large language models (LLMs) become increasingly integrated into daily applications, it is essential to ensure they function fairly across diverse user demographics. In this work, we show that LLMs suffer from personalization bias, where their performance is impacted when they are personalized to a user’s identity. We quantify personalization bias by evaluating the performance of LLMs along two axes - safety and utility. We measure safety by examining how benign LLM responses are to unsafe prompts. We measure utility by evaluating the LLM’s performance on various tasks, including general knowledge, mathematical abilities, programming, and reasoning skills. We find that various LLMs, ranging from open-source models like Llama-3.1 and Mistral to API-based ones like GPT-3.5 and GPT-4o, exhibit significant variance in performance in terms of safety and utility when personalized with different user identities. Finally, we discuss several strategies to mitigate personalization bias and investigate the origin of personalization bias.

2023

pdf bib abs

Unsupervised Opinion Summarization Using Approximate Geodesics
Somnath Basu Roy Chowdhury | Nicholas Monath | Kumar Dubey | Amr Ahmed | Snigdha Chaturvedi
Findings of the Association for Computational Linguistics: EMNLP 2023

Opinion summarization is the task of creating summaries capturing popular opinions from user reviews. In this paper, we introduce Geodesic Summarizer (GeoSumm), a novel system to perform unsupervised extractive opinion summarization. GeoSumm consists of an encoder-decoder based representation learning model that generates topical representations of texts. These representations capture the underlying semantics of the text as a distribution over learnable latent units. GeoSumm generates these topical representations by performing dictionary learning over pre-trained text representations at multiple layers of the decoder. We then use these topical representations to quantify the importance of review sentences using a novel approximate geodesic distance-based scoring mechanism. We use the importance scores to identify popular opinions in order to compose general and aspect-specific summaries. Our proposed model, GeoSumm, achieves strong performance on three opinion summarization datasets. We perform additional experiments to analyze the functioning of our model and showcase the generalization ability of GeoSumm across different domains.

pdf bib abs

Unsupervised Opinion Summarization Using Approximate Geodesics
Somnath Basu Roy Chowdhury | Nicholas Monath | Kumar Dubey | Amr Ahmed | Snigdha Chaturvedi
Proceedings of the 4th New Frontiers in Summarization Workshop

Opinion summarization is the task of creating summaries capturing popular opinions from user reviews.In this paper, we introduce Geodesic Summarizer (GeoSumm), a novel system to perform unsupervised extractive opinion summarization. GeoSumm consists of an encoder-decoder based representation learning model that generates topical representations of texts. These representations capture the underlying semantics of the text as a distribution over learnable latent units. GeoSumm generates these topical representations by performing dictionary learning over pre-trained text representations at multiple layers of the decoder. We then use these topical representations to quantify the importance of review sentences using a novel approximate geodesic distance-based scoring mechanism. We use the importance scores to identify popular opinions in order to compose general and aspect-specific summaries. Our proposed model, GeoSumm, achieves strong performance on three opinion summarization datasets. We perform additional experiments to analyze the functioning of our model and showcase the generalization ability of GeoSumm across different domains.

pdf bib abs

Aspect-aware Unsupervised Extractive Opinion Summarization
Haoyuan Li | Somnath Basu Roy Chowdhury | Snigdha Chaturvedi
Findings of the Association for Computational Linguistics: ACL 2023

Extractive opinion summarization extracts sentences from users’ reviews to represent the prevalent opinions about a product or service. However, the extracted sentences can be redundant and may miss some important aspects, especially for centroid-based extractive summarization models (Radev et al., 2004). To alleviate these issues, we introduce TokenCluster– a method for unsupervised extractive opinion summarization that automatically identifies the aspects described in the review sentences and then extracts sentences based on their aspects. It identifies the underlying aspects of the review sentences using roots of noun phrases and adjectives appearing in them. Empirical evaluation shows that TokenCluster improves aspect coverage in summaries and achieves strong performance on multiple opinion summarization datasets, for both general and aspect-specific summarization. We also perform extensive ablation and human evaluation studies to validate the design choices of our method. The implementation of our work is available at https://github.com/leehaoyuan/TokenCluster

2022

pdf bib abs

Learning Fair Representations via Rate-Distortion Maximization
Somnath Basu Roy Chowdhury | Snigdha Chaturvedi
Transactions of the Association for Computational Linguistics, Volume 10

Text representations learned by machine learning models often encode undesirable demographic information of the user. Predictive models based on these representations can rely on such information, resulting in biased decisions. We present a novel debiasing technique, Fairness-aware Rate Maximization (FaRM), that removes protected information by making representations of instances belonging to the same protected attribute class uncorrelated, using the rate-distortion function. FaRM is able to debias representations with or without a target task at hand. FaRM can also be adapted to remove information about multiple protected attributes simultaneously. Empirical evaluations show that FaRM achieves state-of-the-art performance on several datasets, and learned representations leak significantly less protected attribute information against an attack by a non-linear probing network.

pdf bib abs

Unsupervised Extractive Opinion Summarization Using Sparse Coding
Somnath Basu Roy Chowdhury | Chao Zhao | Snigdha Chaturvedi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Opinion summarization is the task of automatically generating summaries that encapsulate information expressed in multiple user reviews. We present Semantic Autoencoder (SemAE) to perform extractive opinion summarization in an unsupervised manner. SemAE uses dictionary learning to implicitly capture semantic information from the review text and learns a latent representation of each sentence over semantic units. Our extractive summarization algorithm leverages the representations to identify representative opinions among hundreds of reviews. SemAE is also able to perform controllable summarization to generate aspect-specific summaries using only a few samples. We report strong performance on SPACE and AMAZON datasets and perform experiments to investigate the functioning of our model.

pdf bib abs

Read Top News First: A Document Reordering Approach for Multi-Document News Summarization
Chao Zhao | Tenghao Huang | Somnath Basu Roy Chowdhury | Muthu Kumar Chandrasekaran | Kathleen McKeown | Snigdha Chaturvedi
Findings of the Association for Computational Linguistics: ACL 2022

A common method for extractive multi-document news summarization is to re-formulate it as a single-document summarization problem by concatenating all documents as a single meta-document. However, this method neglects the relative importance of documents. We propose a simple approach to reorder the documents according to their relative importance before concatenating and summarizing them. The reordering makes the salient content easier to learn by the summarization model. Experiments show that our approach outperforms previous state-of-the-art methods with more complex architectures.

2021

pdf bib abs

Adversarial Scrubbing of Demographic Information for Text Classification
Somnath Basu Roy Chowdhury | Sayan Ghosh | Yiyuan Li | Junier Oliva | Shashank Srivastava | Snigdha Chaturvedi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Contextual representations learned by language models can often encode undesirable attributes, like demographic associations of the users, while being trained for an unrelated target task. We aim to scrub such undesirable attributes and learn fair representations while maintaining performance on the target task. In this paper, we present an adversarial learning framework “Adversarial Scrubber” (AdS), to debias contextual representations. We perform theoretical analysis to show that our framework converges without leaking demographic information under certain conditions. We extend previous evaluation techniques by evaluating debiasing performance using Minimum Description Length (MDL) probing. Experimental evaluations on 8 datasets show that AdS generates representations with minimal information about demographic attributes while being maximally informative about the target task.

pdf bib abs

Does Commonsense help in detecting Sarcasm?
Somnath Basu Roy Chowdhury | Snigdha Chaturvedi
Proceedings of the Second Workshop on Insights from Negative Results in NLP

Sarcasm detection is important for several NLP tasks such as sentiment identification in product reviews, user feedback, and online forums. It is a challenging task requiring a deep understanding of language, context, and world knowledge. In this paper, we investigate whether incorporating commonsense knowledge helps in sarcasm detection. For this, we incorporate commonsense knowledge into the prediction process using a graph convolution network with pre-trained language model embeddings as input. Our experiments with three sarcasm detection datasets indicate that the approach does not outperform the baseline model. We perform an exhaustive set of experiments to analyze where commonsense support adds value and where it hurts classification. Our implementation is publicly available at: https://github.com/brcsomnath/commonsense-sarcasm.

pdf bib abs

Is Everything in Order? A Simple Way to Order Sentences
Somnath Basu Roy Chowdhury | Faeze Brahman | Snigdha Chaturvedi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The task of organizing a shuffled set of sentences into a coherent text has been used to evaluate a machine’s understanding of causal and temporal relations. We formulate the sentence ordering task as a conditional text-to-marker generation problem. We present Reorder-BART (Re-BART) that leverages a pre-trained Transformer-based model to identify a coherent order for a given set of shuffled sentences. The model takes a set of shuffled sentences with sentence-specific markers as input and generates a sequence of position markers of the sentences in the ordered text. Re-BART achieves the state-of-the-art performance across 7 datasets in Perfect Match Ratio (PMR) and Kendall’s tau. We perform evaluations in a zero-shot setting, showcasing that our model is able to generalize well across other datasets. We additionally perform several experiments to understand the functioning and limitations of our framework.

2019

pdf bib abs

Instance-based Inductive Deep Transfer Learning by Cross-Dataset Querying with Locality Sensitive Hashing
Somnath Basu Roy Chowdhury | Annervaz M | Ambedkar Dukkipati
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

Supervised learning models are typically trained on a single dataset and the performance of these models rely heavily on the size of the dataset i.e., the amount of data available with ground truth. Learning algorithms try to generalize solely based on the data that it is presented with during the training. In this work, we propose an inductive transfer learning method that can augment learning models by infusing similar instances from different learning tasks in Natural Language Processing (NLP) domain. We propose to use instance representations from a source dataset, without inheriting anything else from the source learning model. Representations of the instances of source and target datasets are learned, retrieval of relevant source instances is performed using soft-attention mechanism and locality sensitive hashing and then augmented into the model during training on the target dataset. Therefore, while learning from a training data, we also simultaneously exploit and infuse relevant local instance-level information from an external data. Using this approach we have shown significant improvements over the baseline for three major news classification datasets. Experimental evaluations also show that the proposed approach reduces dependency on labeled data by a significant margin for comparable performance. With our proposed cross dataset learning procedure we show that one can achieve competitive/better performance than learning from a single dataset.

2018

pdf bib abs

Learning beyond Datasets: Knowledge Graph Augmented Neural Networks for Natural Language Processing
Annervaz K M | Somnath Basu Roy Chowdhury | Ambedkar Dukkipati
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Machine Learning has been the quintessential solution for many AI problems, but learning models are heavily dependent on specific training data. Some learning models can be incorporated with prior knowledge using a Bayesian setup, but these learning models do not have the ability to access any organized world knowledge on demand. In this work, we propose to enhance learning models with world knowledge in the form of Knowledge Graph (KG) fact triples for Natural Language Processing (NLP) tasks. Our aim is to develop a deep learning model that can extract relevant prior support facts from knowledge graphs depending on the task using attention mechanism. We introduce a convolution-based model for learning representations of knowledge graph entity and relation clusters in order to reduce the attention space. We show that the proposed method is highly scalable to the amount of prior information that has to be processed and can be applied to any generic NLP task. Using this method we show significant improvement in performance for text classification with 20Newsgroups (News20) & DBPedia datasets, and natural language inference with Stanford Natural Language Inference (SNLI) dataset. We also demonstrate that a deep learning model can be trained with substantially less amount of labeled training data, when it has access to organized world knowledge in the form of a knowledge base.