Ankita Gupta


2024

pdf bib
Evaluating Robustness of Open Dialogue Summarization Models in the Presence of Naturally Occurring Variations
Ankita Gupta | Chulaka Gunasekara | Hui Wan | Jatin Ganhotra | Sachindra Joshi | Marina Danilevsky
Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)

Dialogue summarization involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations). In this study, we systematically investigate the impact of such variations on state-of-the-art open dialogue summarization models whose details are publicly known (e.g., architectures, weights, and training corpora). To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We perform our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which aim to capture different aspects of performance of a summarization model. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate whether the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data. We find that this approach does not yield consistent performance gains, warranting further research. Overall, our work highlights robustness challenges in current open encoder-decoder summarization models and provides insights for future research.

pdf bib
Harnessing Toulmin’s theory for zero-shot argument explication
Ankita Gupta | Ethan Zuckerman | Brendan O’Connor
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To better analyze informal arguments on public forums, we propose the task of argument explication, which makes explicit a text’s argumentative structure and implicit reasoning by outputting triples of propositions ⟨claim, reason warrant⟩. The three slots, or argument components, are derived from the widely known Toulmin (1958) model of argumentation. While prior research applies Toulmin or related theories to annotate datasets and train supervised models, we develop an effective method to prompt generative large language models (LMs) to output explicitly named argument components proposed by Toulmin by prompting with the theory name (e.g., ‘According to Toulmin model’). We evaluate the outputs’ coverage and validity through a human study and automatic evaluation based on prior argumentation datasets and perform robustness checks over alternative LMs, prompts, and argumentation theories. Finally, we conduct a proof-of-concept case study to extract an interpretable argumentation (hyper)graph from a large corpus of critical public comments on whether to allow the COVID-19 vaccine for children, suggesting future directions for corpus analysis and argument visualization.

pdf bib
KaPQA: Knowledge-Augmented Product Question-Answering
Swetha Eppalapally | Daksh Dangi | Chaithra Bhat | Ankita Gupta | Ruiyi Zhang | Shubham Agarwal | Karishma Bagga | Seunghyun Yoon | Nedim Lipka | Ryan Rossi | Franck Dernoncourt
Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP

Question-answering for domain-specific applications has recently attracted much interest due to the latest advancements in large language models (LLMs). However, accurately assessing the performance of these applications remains a challenge, mainly due to the lack of suitable benchmarks that effectively simulate real-world scenarios. To address this challenge, we introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products to help evaluate the performance of existing models on domain-specific product QA tasks. Additionally, we propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task. Our experiments demonstrated that inducing domain knowledge through query reformulation allowed for increased retrieval and generative performance when compared to standard RAG-QA methods. This improvement, however, is slight, and thus illustrates the challenge posed by the datasets introduced.

pdf bib
NarrativeTime: Dense Temporal Annotation on a Timeline
Anna Rogers | Marzena Karpinska | Ankita Gupta | Vladislav Lialin | Gregory Smelkov | Anna Rumshisky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLINKs. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of the classic TimeBankDense corpus (American English), which shows comparable agreement with a signigicant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, and baseline results.

2023

pdf bib
ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Ankita Gupta | Marzena Karpinska | Wenlong Zhao | Kalpesh Krishna | Jack Merullo | Luke Yeh | Mohit Iyyer | Brendan O’Connor
Findings of the Association for Computational Linguistics: EACL 2023

Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.

2022

pdf bib
DEMETR: Diagnosing Evaluation Metrics for Translation
Marzena Karpinska | Nishant Raj | Katherine Thai | Yixiao Song | Ankita Gupta | Mohit Iyyer
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics

pdf bib
Examining Political Rhetoric with Epistemic Stance Detection
Ankita Gupta | Su Lin Blodgett | Justin Gross | Brendan O’connor
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)

Participants in political discourse employ rhetorical strategies—such as hedging, attributions, or denials—to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance prediction, drawn from research in computational semantics, to distinguish at the clausal level what is asserted, denied, or only ambivalently suggested by the author or other mentioned entities (belief holders). We first develop a simple RoBERTa-based model for multi-source stance predictions that outperforms more complex state-of-the-art modeling. Then we demonstrate its novel application to political science by conducting a large-scale analysis of the Mass Market Manifestos corpus of U.S. political opinion books, where we characterize trends in cited belief holders—respected allies and opposed bogeymen—across U.S. political ideologies.

2020

pdf bib
Solomon at SemEval-2020 Task 11: Ensemble Architecture for Fine-Tuned Propaganda Detection in News Articles
Mayank Raj | Ajay Jaiswal | Rohit R.R | Ankita Gupta | Sudeep Kumar Sahoo | Vertika Srivastava | Yeon Hyang Kim
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes our system (Solomon) details and results of participation in the SemEval 2020 Task 11 ”Detection of Propaganda Techniques in News Articles”. We participated in Task ”Technique Classification” (TC) which is a multi-class classification task. To address the TC task, we used RoBERTa based transformer architecture for fine-tuning on the propaganda dataset. The predictions of RoBERTa were further fine-tuned by class-dependent-minority-class classifiers. A special classifier, which employs dynamically adapted Least Common Sub-sequence algorithm, is used to adapt to the intricacies of repetition class. Compared to the other participating systems, our submission is ranked 4th on the leaderboard.

2019

pdf bib
Vernon-fenwick at SemEval-2019 Task 4: Hyperpartisan News Detection using Lexical and Semantic Features
Vertika Srivastava | Ankita Gupta | Divya Prakash | Sudeep Kumar Sahoo | Rohit R.R | Yeon Hyang Kim
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our submission for SemEval-2019 Task 4: Hyperpartisan News Detection. Hyperpartisan news articles are sharply polarized and extremely biased (onesided). It shows blind beliefs, opinions and unreasonable adherence to a party, idea, faction or a person. Through this task, we aim to develop an automated system that can be used to detect hyperpartisan news and serve as a prescreening technique for fake news detection. The proposed system jointly uses a rich set of handcrafted textual and semantic features. Our system achieved 2nd rank on the primary metric (82.0% accuracy) and 1st rank on the secondary metric (82.1% F1-score), among all participating teams. Comparison with the best performing system on the leaderboard shows that our system is behind by only 0.2% absolute difference in accuracy.

pdf bib
SolomonLab at SemEval-2019 Task 8: Question Factuality and Answer Veracity Prediction in Community Forums
Ankita Gupta | Sudeep Kumar Sahoo | Divya Prakash | Rohit R.R | Vertika Srivastava | Yeon Hyang Kim
Proceedings of the 13th International Workshop on Semantic Evaluation

We describe our system for SemEval-2019, Task 8 on “Fact-Checking in Community Question Answering Forums (cQA)”. cQA forums are very prevalent nowadays, as they provide an effective means for communities to share knowledge. Unfortunately, this shared information is not always factual and fact-verified. In this task, we aim to identify factual questions posted on cQA and verify the veracity of answers to these questions. Our approach relies on data augmentation and aggregates cues from several dimensions such as semantics, linguistics, syntax, writing style and evidence obtained from trusted external sources. In subtask A, our submission is ranked 3rd, with an accuracy of 83.14%. Our current best solution stands 1st on the leaderboard with 88% accuracy. In subtask B, our present solution is ranked 2nd, with 58.33% MAP score.

pdf bib
MSIT_SRIB at MEDIQA 2019: Knowledge Directed Multi-task Framework for Natural Language Inference in Clinical Domain.
Sahil Chopra | Ankita Gupta | Anupama Kaushik
Proceedings of the 18th BioNLP Workshop and Shared Task

In this paper, we present Biomedical Multi-Task Deep Neural Network (Bio-MTDNN) on the NLI task of MediQA 2019 challenge. Bio-MTDNN utilizes “transfer learning” based paradigm where not only the source and target domains are different but also the source and target tasks are varied, although related. Further, Bio-MTDNN integrates knowledge from external sources such as clinical databases (UMLS) enhancing its performance on the clinical domain. Our proposed method outperformed the official baseline and other prior models (such as ESIM and Infersent on dev set) by a considerable margin as evident from our experimental results.