Sopan Khosla - ACL Anthology

Sopan Khosla

2025

FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance
Mintong Kang | Vinayshekhar Bannihatti Kumar | Shamik Roy | Abhishek Kumar | Sopan Khosla | Balakrishnan Murali Narayanaswamy | Rashmi Gangadharaiah
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Text-to-image diffusion models often exhibit biases toward specific demographic groups, such as generating more males than females when prompted to generate images of engineers, raising ethical concerns and limiting their adoption. In this paper, we tackle the challenge of mitigating generation bias towards any target attribute value (e.g., “male” for “gender”) in diffusion models while preserving generation quality. We propose FairGen, an adaptive latent guidance mechanism which controls the generation distribution during inference. In FairGen, a latent guidance module dynamically adjusts the diffusion process to enforce specific attributes, while a memory module tracks the generation statistics and steers latent guidance to align with the targeted fair distribution of the attribute values. Further, given the limitations of existing datasets in comprehensively assessing bias in diffusion models, we introduce a holistic bias evaluation benchmark HBE, covering diverse domains and incorporating complex prompts across various applications. Extensive evaluations on HBE and Stable Bias datasets demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (e.g., 68.5% gender bias reduction on Stable Diffusion 2). Ablation studies highlight FairGen’s ability to flexibly and precisely control generation distribution at any user-specified granularity, ensuring adaptive and targeted bias mitigation.

2024

Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA
Dhruv Agarwal | Rajarshi Das | Sopan Khosla | Rashmi Gangadharaiah
Findings of the Association for Computational Linguistics: NAACL 2024

We present BYOKG, a universal question-answering (QA) system that can operate on any knowledge graph (KG), requires no human-annotated training data, and can be ready to use within a day—attributes that are out-of-scope for current KGQA systems. BYOKG draws inspiration from the remarkable ability of humans to comprehend information present in an unseen KG through exploration—starting at random nodes, inspecting the labels of adjacent nodes and edges, and combining them with their prior world knowledge. Exploration in BYOKG leverages an LLM-backed symbolic agent that generates a diverse set of query-program exemplars, which are then used to ground a retrieval-augmented reasoning procedure to synthesize programs for arbitrary questions. BYOKG is effective over both small- and large-scale graphs, showing dramatic gains in zero-shot QA accuracy of 27.89 and 59.88 F1 on GrailQA and MetaQA, respectively. We further find that performance of BYOKG reliably improves with continued exploration as well as improvements in the base LLM, notably outperforming a state-of-the-art fine-tuned model by 7.08 F1 on a sub-sampled zero-shot split of GrailQA. Lastly, we verify our universality claim by evaluating BYOKG on a domain-specific materials science KG and show that it improves zero-shot performance by 46.33 F1.

Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
Koji Inoue | Yahui Fu | Agnes Axelsson | Atsumoto Ohashi | Brielen Madureira | Yuki Zenimoto | Biswesh Mohapatra | Armand Stricker | Sopan Khosla
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems

2023

Information Extraction and Program Synthesis from Goal-Oriented Dialogue
Sopan Khosla
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

My research interests broadly lie in the area of Information Extraction from Spoken Dialogue, with a spacial focus on state modeling, anaphora resolution, program synthesis & planning, and intent classification in goal-oriented conversations. My aim is to create embedded dialogue systems that can interact with humans in a collaborative setup to solve tasks in a digital/non-digital environment. Most of the goal-oriented conversations usually involve experts and a laypersons. The aim for the expert is to consider all the information provided by the layperson, identify the underlying set of issues or intents, and prescribe solutions. While human experts are very good at extracting such information, AI agents (that build up most of the automatic dialog systems today) not so much. Most of the existing assistants (or chatbots) only consider individual utterances and do not ground them in the context of the dialogue. My work in this direction has focused on making these systems more effective at extracting the most relevant information from the dialogue to help the human user reach their end-goal.

GrailQA++: A Challenging Zero-Shot Benchmark for Knowledge Base Question Answering
Ritam Dutt | Sopan Khosla | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Proceedings of the Third Workshop on NLP for Medical Conversations
Sopan Khosla
Proceedings of the Third Workshop on NLP for Medical Conversations

Exploring the Reasons for Non-generalizability of KBQA systems
Sopan Khosla | Ritam Dutt | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP

Recent research has demonstrated impressive generalization capabilities of several Knowledge Base Question Answering (KBQA) models on the GrailQA dataset. We inspect whether these models can generalize to other datasets in a zero-shot setting. We notice a significant drop in performance and investigate the causes for the same. We observe that the models are dependent not only on the structural complexity of the questions, but also on the linguistic styles of framing a question. Specifically, the linguistic dimensions corresponding to explicitness, readability, coherence, and grammaticality have a significant impact on the performance of state-of-the-art KBQA models. Overall our results showcase the brittleness of such models and the need for creating generalizable systems.

2022

Evaluating the Practical Utility of Confidence-score based Techniques for Unsupervised Open-world Classification
Sopan Khosla | Rashmi Gangadharaiah
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Open-world classification in dialog systems require models to detect open intents, while ensuring the quality of in-domain (ID) intent classification. In this work, we revisit methods that leverage distance-based statistics for unsupervised out-of-domain (OOD) detection. We show that despite their superior performance on threshold-independent metrics like AUROC on test-set, threshold values chosen based on the performance on a validation-set do not generalize well to the test-set, thus resulting in substantially lower performance on ID or OOD detection accuracy and F1-scores. Our analysis shows that this lack of generalizability can be successfully mitigated by setting aside a hold-out set from validation data for threshold selection (sometimes achieving relative gains as high as 100%). Extensive experiments on seven benchmark datasets show that this fix puts the performance of these methods at par with, or sometimes even better than, the current state-of-the-art OOD detection techniques.

Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Juntao Yu | Sopan Khosla | Ramesh Manuvinakurike | Lori Levin | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rose
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Juntao Yu | Sopan Khosla | Ramesh Manuvinakurike | Lori Levin | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2022 Shared Task on Anaphora Resolution in Dialogues is the second edition of an initiative focused on detecting different types of anaphoric relations in conversations of different kinds. Using five conversational datasets, four of which have been newly annotated with a wide range of anaphoric relations: identity, bridging references and discourse deixis, we defined multiple tasks focusing individually on these key relations. The second edition of the shared task maintained the focus on these relations and used the same datasets as in 2021, but new test data were annotated, the 2021 data were checked, and new subtasks were added. In this paper, we discuss the annotation schemes, the datasets, the evaluation scripts used to assess the system performance on these tasks, and provide a brief summary of the participating systems and the results obtained across 230 runs from three teams, with most submissions achieving significantly better results than our baseline methods.

The Universal Anaphora Scorer
Juntao Yu | Sopan Khosla | Nafise Sadat Moosavi | Silviu Paun | Sameer Pradhan | Massimo Poesio
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, deliver datasets encoded according to these standards, and developing methods for evaluating models carrying out this type of interpretation. Such expansion of the scope of anaphora resolution requires a comparable expansion of the scope of the scorers used to evaluate this work. In this paper, we introduce an extended version of the Reference Coreference Scorer (Pradhan et al., 2014) that can be used to evaluate the extended range of anaphoric interpretation included in the current Universal Anaphora proposal. The UA scorer supports the evaluation of identity anaphora resolution and of bridging reference resolution, for which scorers already existed but not integrated in a single package. It also supports the evaluation of split antecedent anaphora and discourse deixis, for which no tools existed. The proposed approach to the evaluation of split antecedent anaphora is entirely novel; the proposed approach to the evaluation of discourse deixis leverages the encoding of discourse deixis proposed in Universal Anaphora to enable the use for discourse deixis of the same metrics already used for identity anaphora. The scorer was tested in the recent CODI-CRAC 2021 Shared Task on Anaphora Resolution in Dialogues.

Benchmarking the Covariate Shift Robustness of Open-world Intent Classification Approaches
Sopan Khosla | Rashmi Gangadharaiah
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Task-oriented dialog systems deployed in real-world applications are often challenged by out-of-distribution queries. These systems should not only reliably detect utterances with unsupported intents (semantic shift), but also generalize to covariate shift (supported intents from unseen distributions). However, none of the existing benchmarks for open-world intent classification focus on the second aspect, thus only performing a partial evaluation of intent detection techniques. In this work, we propose two new datasets ( and ) that include utterances useful for evaluating the robustness of open-world models to covariate shift. Along with the i.i.d. test set, both datasets contain a new cov-test set that, along with out-of-scope utterances, contains in-scope utterances sampled from different distributions not seen during training. This setting better mimics the challenges faced in real-world applications. Evaluating several open-world classifiers on the new datasets reveals that models that perform well on the test set struggle to generalize to the cov-test. Our datasets fill an important gap in the field, offering a more realistic evaluation scenario for intent classification in task-oriented dialog systems.

2021

FanfictionNLP: A Text Processing Pipeline for Fanfiction
Michael Yoder | Sopan Khosla | Qinlan Shen | Aakanksha Naik | Huiming Jin | Hariharan Muralidharan | Carolyn Rosé
Proceedings of the Third Workshop on Narrative Understanding

Fanfiction presents an opportunity as a data source for research in NLP, education, and social science. However, answering specific research questions with this data is difficult, since fanfiction contains more diverse writing styles than formal fiction. We present a text processing pipeline for fanfiction, with a focus on identifying text associated with characters. The pipeline includes modules for character identification and coreference, as well as the attribution of quotes and narration to those characters. Additionally, the pipeline contains a novel approach to character coreference that uses knowledge from quote attribution to resolve pronouns within quotes. For each module, we evaluate the effectiveness of various approaches on 10 annotated fanfiction stories. This pipeline outperforms tools developed for formal fiction on the tasks of character coreference and quote attribution

Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Sopan Khosla | Ramesh Manuvinakurike | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Sopan Khosla | Juntao Yu | Ramesh Manuvinakurike | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

In this paper, we provide an overview of the CODI-CRAC 2021 Shared-Task: Anaphora Resolution in Dialogue. The shared task focuses on detecting anaphoric relations in different genres of conversations. Using five conversational datasets, four of which have been newly annotated with a wide range of anaphoric relations: identity, bridging references and discourse deixis, we defined multiple subtasks focusing individually on these key relations. We discuss the evaluation scripts used to assess the system performance on these subtasks, and provide a brief summary of the participating systems and the results obtained across ?? runs from 5 teams, with most submissions achieving significantly better results than our baseline methods.

Counterfactuals to Control Latent Disentangled Text Representations for Style Transfer
Sharmila Reddy Nangi | Niyati Chhaya | Sopan Khosla | Nikhil Kaushik | Harshit Nyati
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Disentanglement of latent representations into content and style spaces has been a commonly employed method for unsupervised text style transfer. These techniques aim to learn the disentangled representations and tweak them to modify the style of a sentence. In this paper, we propose a counterfactual-based method to modify the latent representation, by posing a ‘what-if’ scenario. This simple and disciplined approach also enables a fine-grained control on the transfer strength. We conduct experiments with the proposed methodology on multiple attribute transfer tasks like Sentiment, Formality and Excitement to support our hypothesis.

Team JARS: DialDoc Subtask 1 - Improved Knowledge Identification with Supervised Out-of-Domain Pretraining
Sopan Khosla | Justin Lovelace | Ritam Dutt | Adithya Pratapa
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

In this paper, we discuss our submission for DialDoc subtask 1. The subtask requires systems to extract knowledge from FAQ-type documents vital to reply to a user’s query in a conversational setting. We experiment with pretraining a BERT-based question-answering model on different QA datasets from MRQA, as well as conversational QA datasets like CoQA and QuAC. Our results show that models pretrained on CoQA and QuAC perform better than their counterparts that are pretrained on MRQA datasets. Our results also indicate that adding more pretraining data does not necessarily result in improved performance. Our final model, which is an ensemble of AlBERT-XL pretrained on CoQA and QuAC independently, with the chosen answer having the highest average probability score, achieves an F1-Score of 70.9% on the official test-set.

Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques
Kundan Krishna | Sopan Khosla | Jeffrey Bigham | Zachary C. Lipton
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between physicians and patients. After exploring a spectrum of methods across the extractive-abstractive spectrum, we propose Cluster2Sent, an algorithm that (i) extracts important utterances relevant to each summary section; (ii) clusters together related utterances; and then (iii) generates one summary sentence per cluster. Cluster2Sent outperforms its purely abstractive counterpart by 8 ROUGE-1 points, and produces significantly more factual and coherent sentences as assessed by expert human evaluators. For reproducibility, we demonstrate similar benefits on the publicly available AMI dataset. Our results speak to the benefits of structuring summaries into sections and annotating supporting evidence when constructing summarization corpora.

Evaluating the Impact of a Hierarchical Discourse Representation on Entity Coreference Resolution Performance
Sopan Khosla | James Fiacco | Carolyn Rosé
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent work on entity coreference resolution (CR) follows current trends in Deep Learning applied to embeddings and relatively simple task-related features. SOTA models do not make use of hierarchical representations of discourse structure. In this work, we leverage automatically constructed discourse parse trees within a neural approach and demonstrate a significant improvement on two benchmark entity coreference-resolution datasets. We explore how the impact varies depending upon the type of mention.

2020

Using Type Information to Improve Entity Coreference Resolution
Sopan Khosla | Carolyn Rose
Proceedings of the First Workshop on Computational Approaches to Discourse

Coreference resolution (CR) is an essential part of discourse analysis. Most recently, neural approaches have been proposed to improve over SOTA models from earlier paradigms. So far none of the published neural models leverage external semantic knowledge such as type information. This paper offers the first such model and evaluation, demonstrating modest gains in accuracy by introducing either gold standard or predicted types. In the proposed approach, type information serves both to (1) improve mention representation and (2) create a soft type consistency check between coreference candidate mentions. Our evaluation covers two different grain sizes of types over four different benchmark corpora.

LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for Multi-Granular Propaganda Span Identification
Sopan Khosla | Rishabh Joshi | Ritam Dutt | Alan W Black | Yulia Tsvetkov
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper we describe our submission for the task of Propaganda Span Identification in news articles. We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda. The ”multi-granular” model incorporates linguistic knowledge at various levels of text granularity, including word, sentence and document level syntactic, semantic and pragmatic affect features, which significantly improve model performance, compared to its language-agnostic variant. To facilitate better representation learning, we also collect a corpus of 10k news articles, and use it for fine-tuning the model. The final model is a majority-voting ensemble which learns different propaganda class boundaries by leveraging different subsets of incorporated knowledge.

MedFilter: Improving Extraction of Task-relevant Utterances through Integration of Discourse Structure and Ontological Knowledge
Sopan Khosla | Shikhar Vashishth | Jill Fain Lehman | Carolyn Rose
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for machines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information communicated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach MedFilter, which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where MedFilter is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances benefits downstream medical processing, achieving improvements of 15%, 105%, and 23% respectively for the extraction of symptoms, medications, and complaints.

2018

EmotionX-AR: CNN-DCNN autoencoder based Emotion Classifier
Sopan Khosla
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

In this paper, we model emotions in EmotionLines dataset using a convolutional-deconvolutional autoencoder (CNN-DCNN) framework. We show that adding a joint reconstruction loss improves performance. Quantitative evaluation with jointly trained network, augmented with linguistic features, reports best accuracies for emotion prediction; namely joy, sadness, anger, and neutral emotion in text.

Aff2Vec: Affect–Enriched Distributional Word Representations
Sopan Khosla | Niyati Chhaya | Kushal Chawla
Proceedings of the 27th International Conference on Computational Linguistics

Human communication includes information, opinions and reactions. Reactions are often captured by the affective-messages in written as well as verbal communications. While there has been work in affect modeling and to some extent affective content generation, the area of affective word distributions is not well studied. Synsets and lexica capture semantic relationships across words. These models, however, lack in encoding affective or emotional word interpretations. Our proposed model, Aff2Vec, provides a method for enriched word embeddings that are representative of affective interpretations of words. Aff2Vec outperforms the state-of-the-art in intrinsic word-similarity tasks. Further, the use of Aff2Vec representations outperforms baseline embeddings in downstream natural language understanding tasks including sentiment analysis, personality detection, and frustration prediction.

Venues