Ankita Gupta

2025

Supervised fine-tuning (SFT) on benign data can paradoxically erode a language model’s safety alignment, a phenomenon known as catastrophic forgetting of safety behaviors. Although prior work shows that randomly adding safety examples can reduce harmful output, the principles that make certain examples more effective than others remain poorly understood. This paper investigates the hypothesis that the effectiveness of a safety example is governed by two key factors: its instruction-response behavior (e.g., refusal vs. explanation) and its semantic diversity across harm categories. We systematically evaluate sampling strategies based on these axes and find that structured, diversity-aware sampling significantly improves model safety. Our method reduces harmfulness by up to 41% while adding only 0.05% more data to the fine-tuning set.

pdf bib abs

Automated main concept generation for narrative discourse assessment in aphasia
Ankita Gupta | Marisa Hudspeth | Polly Stokes | Jacquie Kurland | Brendan O’Connor
Findings of the Association for Computational Linguistics: ACL 2025

We present an interesting application of narrative understanding in the clinical assessment of aphasia, where story retelling tasks are used to evaluate a patient’s communication abilities. This clinical setting provides a framework to help operationalize narrative discourse analysis and an application-focused evaluation method for narrative understanding systems. In particular, we highlight the use of main concepts (MCs)—a list of statements that capture a story’s gist—for aphasic discourse analysis. We then propose automatically generating MCs from novel stories, which experts can edit manually, thus enabling wider adaptation of current assessment tools. We further develop a prompt ensemble method using large language models (LLMs) to automatically generate MCs for a novel story. We evaluate our method on an existing narrative summarization dataset to establish its intrinsic validity. We further apply it to a set of stories that have been annotated with MCs through extensive analysis of retells from non-aphasic and aphasic participants (Kurland et al., 2021, 2025). Our results show that our proposed method can generate most of the gold-standard MCs for stories from this dataset. Finally, we release this dataset of stories with annotated MCs to spur more research in this area.

pdf bib abs

𝛿-Stance: A Large-Scale Real World Dataset of Stances in Legal Argumentation
Ankita Gupta | Douglas Rice | Brendan O’Connor
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present 𝛿-Stance, a large-scale dataset of stances involved in legal argumentation. 𝛿-Stance contains stance-annotated argument pairs, semi-automatically mined from millions of examples of U.S. judges citing precedent in context using citation signals. The dataset aims to facilitate work on the legal argument stance classification task, which involves assessing whether a case summary strengthens or weakens a legal argument (polarity) and to what extent (intensity). To assess the complexity of this task, we evaluate various existing NLP methods, including zero-shot prompting proprietary large language models (LLMs), and supervised fine-tuning of smaller open-weight language models (LMs) on 𝛿-Stance. Our findings reveal that although prompting proprietary LLMs can help predict stance polarity, supervised model fine-tuning on 𝛿-Stance is necessary to distinguish intensity. We further find that alternative strategies such as domain-specific pretraining and zero-shot prompting using masked LMs remain insufficient. Beyond our dataset’s utility for the legal domain, we further find that fine-tuning small LMs on 𝛿-Stance improves their performance in other domains. Finally, we study how temporal changes in signal definition can impact model performance, highlighting the importance of careful data curation for downstream tasks by considering the historical and sociocultural context. We publish the associated dataset to foster further research on legal argument reasoning.

2024

pdf bib abs

Harnessing Toulmin’s theory for zero-shot argument explication
Ankita Gupta | Ethan Zuckerman | Brendan O’Connor
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To better analyze informal arguments on public forums, we propose the task of argument explication, which makes explicit a text’s argumentative structure and implicit reasoning by outputting triples of propositions ⟨claim, reason warrant⟩. The three slots, or argument components, are derived from the widely known Toulmin (1958) model of argumentation. While prior research applies Toulmin or related theories to annotate datasets and train supervised models, we develop an effective method to prompt generative large language models (LMs) to output explicitly named argument components proposed by Toulmin by prompting with the theory name (e.g., ‘According to Toulmin model’). We evaluate the outputs’ coverage and validity through a human study and automatic evaluation based on prior argumentation datasets and perform robustness checks over alternative LMs, prompts, and argumentation theories. Finally, we conduct a proof-of-concept case study to extract an interpretable argumentation (hyper)graph from a large corpus of critical public comments on whether to allow the COVID-19 vaccine for children, suggesting future directions for corpus analysis and argument visualization.

pdf bib abs

Question-answering for domain-specific applications has recently attracted much interest due to the latest advancements in large language models (LLMs). However, accurately assessing the performance of these applications remains a challenge, mainly due to the lack of suitable benchmarks that effectively simulate real-world scenarios. To address this challenge, we introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products to help evaluate the performance of existing models on domain-specific product QA tasks. Additionally, we propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task. Our experiments demonstrated that inducing domain knowledge through query reformulation allowed for increased retrieval and generative performance when compared to standard RAG-QA methods. This improvement, however, is slight, and thus illustrates the challenge posed by the datasets introduced.

pdf bib abs

NarrativeTime: Dense Temporal Annotation on a Timeline
Anna Rogers | Marzena Karpinska | Ankita Gupta | Vladislav Lialin | Gregory Smelkov | Anna Rumshisky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLINKs. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of the classic TimeBankDense corpus (American English), which shows comparable agreement with a signigicant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, and baseline results.

pdf bib abs

Evaluating Robustness of Open Dialogue Summarization Models in the Presence of Naturally Occurring Variations
Ankita Gupta | Chulaka Gunasekara | Hui Wan | Jatin Ganhotra | Sachindra Joshi | Marina Danilevsky
Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)

Dialogue summarization involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations). In this study, we systematically investigate the impact of such variations on state-of-the-art open dialogue summarization models whose details are publicly known (e.g., architectures, weights, and training corpora). To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We perform our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which aim to capture different aspects of performance of a summarization model. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate whether the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data. We find that this approach does not yield consistent performance gains, warranting further research. Overall, our work highlights robustness challenges in current open encoder-decoder summarization models and provides insights for future research.

2023

pdf bib abs

Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.

2022

pdf bib abs

Examining Political Rhetoric with Epistemic Stance Detection
Ankita Gupta | Su Lin Blodgett | Justin H Gross | Brendan O’Connor
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Participants in political discourse employ rhetorical strategies—such as hedging, attributions, or denials—to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance prediction, drawn from research in computational semantics, to distinguish at the clausal level what is asserted, denied, or only ambivalently suggested by the author or other mentioned entities (belief holders). We first develop a simple RoBERTa-based model for multi-source stance predictions that outperforms more complex state-of-the-art modeling. Then we demonstrate its novel application to political science by conducting a large-scale analysis of the Mass Market Manifestos corpus of U.S. political opinion books, where we characterize trends in cited belief holders—respected allies and opposed bogeymen—across U.S. political ideologies.

pdf bib abs

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics

2020

pdf bib abs

This paper describes our system (Solomon) details and results of participation in the SemEval 2020 Task 11 ”Detection of Propaganda Techniques in News Articles”. We participated in Task ”Technique Classification” (TC) which is a multi-class classification task. To address the TC task, we used RoBERTa based transformer architecture for fine-tuning on the propaganda dataset. The predictions of RoBERTa were further fine-tuned by class-dependent-minority-class classifiers. A special classifier, which employs dynamically adapted Least Common Sub-sequence algorithm, is used to adapt to the intricacies of repetition class. Compared to the other participating systems, our submission is ranked 4th on the leaderboard.

2019

pdf bib abs

Vernon-fenwick at SemEval-2019 Task 4: Hyperpartisan News Detection using Lexical and Semantic Features
Vertika Srivastava | Ankita Gupta | Divya Prakash | Sudeep Kumar Sahoo | Rohit R.R | Yeon Hyang Kim
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our submission for SemEval-2019 Task 4: Hyperpartisan News Detection. Hyperpartisan news articles are sharply polarized and extremely biased (onesided). It shows blind beliefs, opinions and unreasonable adherence to a party, idea, faction or a person. Through this task, we aim to develop an automated system that can be used to detect hyperpartisan news and serve as a prescreening technique for fake news detection. The proposed system jointly uses a rich set of handcrafted textual and semantic features. Our system achieved 2nd rank on the primary metric (82.0% accuracy) and 1st rank on the secondary metric (82.1% F1-score), among all participating teams. Comparison with the best performing system on the leaderboard shows that our system is behind by only 0.2% absolute difference in accuracy.

pdf bib abs

SolomonLab at SemEval-2019 Task 8: Question Factuality and Answer Veracity Prediction in Community Forums
Ankita Gupta | Sudeep Kumar Sahoo | Divya Prakash | Rohit R.R | Vertika Srivastava | Yeon Hyang Kim
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe our system for SemEval-2019, Task 8 on “Fact-Checking in Community Question Answering Forums (cQA)”. cQA forums are very prevalent nowadays, as they provide an effective means for communities to share knowledge. Unfortunately, this shared information is not always factual and fact-verified. In this task, we aim to identify factual questions posted on cQA and verify the veracity of answers to these questions. Our approach relies on data augmentation and aggregates cues from several dimensions such as semantics, linguistics, syntax, writing style and evidence obtained from trusted external sources. In subtask A, our submission is ranked 3rd, with an accuracy of 83.14%. Our current best solution stands 1st on the leaderboard with 88% accuracy. In subtask B, our present solution is ranked 2nd, with 58.33% MAP score.

pdf bib abs

MSIT_SRIB at MEDIQA 2019: Knowledge Directed Multi-task Framework for Natural Language Inference in Clinical Domain.
Sahil Chopra | Ankita Gupta | Anupama Kaushik
Proceedings of the 18th BioNLP Workshop and Shared Task

In this paper, we present Biomedical Multi-Task Deep Neural Network (Bio-MTDNN) on the NLI task of MediQA 2019 challenge. Bio-MTDNN utilizes “transfer learning” based paradigm where not only the source and target domains are different but also the source and target tasks are varied, although related. Further, Bio-MTDNN integrates knowledge from external sources such as clinical databases (UMLS) enhancing its performance on the clinical domain. Our proposed method outperformed the official baseline and other prior models (such as ESIM and Infersent on dev set) by a considerable margin as evident from our experimental results.

Co-authors

Venues