Swabha Swayamdipta - ACL Anthology

Swabha Swayamdipta

2025

Improving Language Model Personas via Rationalization with Psychological Scaffolds
Brihi Joshi | Xiang Ren | Swabha Swayamdipta | Rik Koncel-Kedziorski | Tim Paek
Findings of the Association for Computational Linguistics: EMNLP 2025

Language models prompted with a user description or persona have been used to predict the user’s preferences and opinions. However, existing approaches to building personas mostly rely on a user’s demographic attributes and/or prior judgments, but not on any underlying reasoning behind a user’s judgments. We introduce PB&J (Psychology of Behavior and Judgments), a framework that improves LM personas by incorporating potential rationales for why the user could have made a certain judgment. Our rationales are generated by a language model to explicitly reason about a user’s behavior on the basis of their experiences, personality traits, or beliefs. Our method employs psychological scaffolds: structured frameworks such as the Big 5 Personality Traits or Primal World Beliefs to help ground the generated rationales in existing theories. Experiments on public opinion and movie preference prediction tasks demonstrate that language model personas augmented with PB&J rationales consistently outperform personas conditioned only on user demographics and / or judgments, including those that use a model’s default chain-of-thought, which is not grounded in psychological theories. Additionally, our PB&J personas perform competitively with those using human-written rationales, suggesting the potential value of synthetic rationales guided by existing theories.

Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge
Xinyue Cui | Johnny Wei | Swabha Swayamdipta | Robin Jia
Findings of the Association for Computational Linguistics: ACL 2025

Data watermarking in language models injects traceable signals, such as specific token sequences or stylistic patterns, into copyrighted text, allowing copyright holders to track and verify training data ownership. Previous data watermarking techniques primarily focus on effective memorization after pretraining, while overlooking challenges that arise in other stages of the LLM pipeline, such as the risk of watermark filtering during data preprocessing, or potential forgetting through post-training, or verification difficulties due to API-only access. We propose a novel data watermarking approach that injects coherent and plausible yet fictitious knowledge into training data using generated passages describing a fictitious entity and its associated attributes. Our watermarks are designed to be memorized by the LLM through seamlessly integrating in its training data, making them harder to detect lexically during preprocessing. We demonstrate that our watermarks can be effectively memorized by LLMs, and that increasing our watermarks’ density, length, and diversity of attributes strengthens their memorization. We further show that our watermarks remain robust throughout LLM development, maintaining their effectiveness after continual pretraining and supervised finetuning. Finally, we show that our data watermarks can be evaluated even under API-only access via question answering.

Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)
Maria Lomeli | Swabha Swayamdipta | Rui Zhang
Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts)

Evaluating Evaluation Metrics – The Mirage of Hallucination Detection
Atharva Kulkarni | Yuan Zhang | Joel Ruben Antony Moniz | Xiou Ge | Bo-Hsiang Tseng | Dhivya Piraviperumal | Swabha Swayamdipta | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2025

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations
Brihi Joshi | Keyu He | Sahana Ramnath | Sadra Sabouri | Kaitlyn Zhou | Souti Chattopadhyay | Swabha Swayamdipta | Xiang Ren
Findings of the Association for Computational Linguistics: ACL 2025

Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K “Why” questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an “educator” to assess model explanations’ fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.

Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)
Bryan Eikema | Raúl Vázquez | Jonathan Berant | Marie-Catherine de Marneffe | Barbara Plank | Artem Shelmanov | Swabha Swayamdipta | Jörg Tiedemann | Chrysoula Zerva | Wilker Aziz
Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)

Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge
Xinyue Cui | Johnny Wei | Swabha Swayamdipta | Robin Jia
Proceedings of the First Workshop on Large Language Model Memorization (L2M2)

Data watermarking in language models injects traceable signals, such as specific token sequences or stylistic patterns, into copyrighted text, allowing copyright holders to track and verify training data ownership. Previous data watermarking techniques primarily focus on effective memorization during pretraining, while overlooking challenges that arise in other stages of the LLM lifecycle, such as the risk of watermark filtering during data preprocessing and verification difficulties due to API-only access. To address these challenges, we propose a novel data watermarking approach that injects plausible yet fictitious knowledge into training data using generated passages describing a fictitious entity and its associated attributes. Our watermarks are designed to be memorized by the LLM through seamlessly integrating in its training data, making them harder to detect lexically during preprocessing. We demonstrate that our watermarks can be effectively memorized by LLMs, and that increasing our watermarks’ density, length, and diversity of attributes strengthens their memorization. We further show that our watermarks remain effective after continual pretraining and supervised finetuning. Finally, we show that our data watermarks can be evaluated even under API-only access via question answering.

2024

Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)
Raúl Vázquez | Hande Celikkanat | Dennis Ulmer | Jörg Tiedemann | Swabha Swayamdipta | Wilker Aziz | Barbara Plank | Joris Baan | Marie-Catherine de Marneffe
Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024)

Annotating FrameNet via Structure-Conditioned Language Generation
Xinyue Cui | Swabha Swayamdipta
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Yang (Trista) Cao | Isabel Papadimitriou | Anaelia Ovalle | Marcos Zampieri | Francis Ferraro | Swabha Swayamdipta
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh | Tejas Srinivasan | Swabha Swayamdipta
Findings of the Association for Computational Linguistics: EMNLP 2024

Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference evaluation of LLMs with both human- and auto-raters.

OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants
Jaspreet Ranjit | Brihi Joshi | Rebecca Dorn | Laura Petry | Olga Koumoundouros | Jayne Bottarini | Peichen Liu | Eric Rice | Swabha Swayamdipta
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Warning: Contents of this paper may be upsetting.Public attitudes towards key societal issues, expressed on online media, are of immense value in policy and reform efforts, yet challenging to understand at scale. We study one such social issue: homelessness in the U.S., by leveraging the remarkable capabilities of large language models to assist social work experts in analyzing millions of posts from Twitter. We introduce a framing typology: Online Attitudes Towards Homelessness (OATH) Frames: nine hierarchical frames capturing critiques, responses and perceptions. We release annotations with varying degrees of assistance from language models, with immense benefits in scaling: 6.5× speedup in annotation time while only incurring a 3 point F1 reduction in performance with respect to the domain experts. Our experiments demonstrate the value of modeling OATH-Frames over existing sentiment and toxicity classifiers. Our large-scale analysis with predicted OATH-Frames on 2.4M posts on homelessness reveal key trends in attitudes across states, time periods and vulnerable populations, enabling new insights on the issue. Our work provides a general framework to understand nuanced public attitudes at scale, on issues beyond homelessness.

NeuroComparatives: Neuro-Symbolic Distillation of Comparative Knowledge
Phillip Howard | Junlin Wang | Vasudev Lal | Gadi Singer | Yejin Choi | Swabha Swayamdipta
Findings of the Association for Computational Linguistics: NAACL 2024

Comparative knowledge (e.g., steel is stronger and heavier than styrofoam) is an essential component of our world knowledge, yet understudied in prior literature. In this paper, we harvest the dramatic improvements in knowledge capabilities of language models into a large-scale comparative knowledge base. While the ease of acquisition of such comparative knowledge is much higher from extreme-scale models like GPT-4, compared to their considerably smaller and weaker counterparts such as GPT-2, not even the most powerful models are exempt from making errors. We thus ask: to what extent are models at different scales able to generate valid and diverse comparative knowledge?We introduce NeuroComparatives, a novel framework for comparative knowledge distillation overgenerated from language models such as GPT-variants and LLaMA, followed by stringent filtering of the generated knowledge. Our framework acquires comparative knowledge between everyday objects, producing a corpus of up to 8.8M comparisons over 1.74M entity pairs - 10X larger and 30% more diverse than existing resources. Moreover, human evaluations show that NeuroComparatives outperform existing resources in terms of validity (up to 32% absolute improvement). Our acquired NeuroComparatives leads to performance improvements on five downstream tasks.We find that neuro-symbolic manipulation of smaller models offers complementary benefits to the currently dominant practice of prompting extreme-scale language models for knowledge distillation.

Out-of-Distribution Detection through Soft Clustering with Non-Negative Kernel Regression
Aryan Gulati | Xingjian Dong | Carlos Hurtado | Sarath Shekkizhar | Swabha Swayamdipta | Antonio Ortega
Findings of the Association for Computational Linguistics: EMNLP 2024

As language models become more general purpose, increased attention needs to be paid to detecting out-of-distribution (OOD) instances, i.e., those not belonging to any of the distributions seen during training. Existing methods for detecting OOD data are computationally complex and storage-intensive. We propose a novel soft clustering approach for OOD detection based on non-negative kernel regression. Our approach greatly reduces computational and space complexities (up to 11× improvement in inference time and 87% reduction in storage requirements). It outperforms existing approaches by up to 4 AUROC points on four benchmarks. We also introduce an entropy-constrained version of our algorithm, leading to further reductions in storage requirements (up to 97% lower than comparable approaches) while retaining competitive performance. Our soft clustering approach for OOD detection highlights its potential for detecting tail-end phenomena in extreme-scale data settings. Our source code is available on Github.

2023

REV: Information-Theoretic Evaluation of Free-Text Rationales
Hanjie Chen | Faeze Brahman | Xiang Ren | Yangfeng Ji | Yejin Choi | Swabha Swayamdipta
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generating free-text rationales is a promising step towards explainable NLP, yet evaluating such rationales remains a challenge. Existing metrics have mostly focused on measuring the association between the rationale and a given label. We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label. We investigate this research problem from an information-theoretic perspective using conditional V-information (Hewitt et al., 2021). More concretely, we propose a metric called REV (Rationale Evaluation with conditional V-information), to quantify the amount of new, label-relevant information in a rationale beyond the information already available in the input or the label. Experiments across four benchmarks with reasoning tasks, including chain-of-thought, demonstrate the effectiveness of REV in evaluating rationale-label pairs, compared to existing metrics. We further demonstrate REV is consistent with human judgments on rationale evaluations and provides more sensitive measurements of new information in free-text rationales. When used alongside traditional performance metrics, REV provides deeper insights into models’ reasoning and prediction processes.

We’re Afraid Language Models Aren’t Modeling Ambiguity
Alisa Liu | Zhaofeng Wu | Julian Michael | Alane Suhr | Peter West | Alexander Koller | Swabha Swayamdipta | Noah Smith | Yejin Choi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Ambiguity is an intrinsic feature of natural language. Managing ambiguity is a key part of human language understanding, allowing us to anticipate misunderstanding as communicators and revise our interpretations as listeners. As language models are increasingly employed as dialogue interfaces and writing aids, handling ambiguous language is critical to their success. We capture ambiguity in a sentence through its effect on entailment relations with another sentence, and collect AmbiEnt, a linguist-annotated benchmark of 1,645 examples with diverse kinds of ambiguity. We design a suite of tests based on AmbiEnt, presenting the first evaluation of pretrained LMs to recognize ambiguity and disentangle possible meanings. We find that the task remains extremely challenging, including for GPT-4, whose generated disambiguations are considered correct only 32% of the time in crowdworker evaluation, compared to 90% for disambiguations in our dataset. Finally, to illustrate the value of ambiguity-sensitive tools, we show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity. We encourage the field to rediscover the importance of ambiguity for NLP.

I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation
Chandra Bhagavatula | Jena D. Hwang | Doug Downey | Ronan Le Bras | Ximing Lu | Lianhui Qin | Keisuke Sakaguchi | Swabha Swayamdipta | Peter West | Yejin Choi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation algorithms?The key intellectual challenge is to design a learning algorithm that achieve a competitive level of commonsense acquisition, without relying on the benefits of scale. In particular, we study generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale teacher model with two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model’s own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date.

COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements
Xuhui Zhou | Hao Zhu | Akhila Yerukola | Thomas Davidson | Jena D. Hwang | Swabha Swayamdipta | Maarten Sap
Findings of the Association for Computational Linguistics: ACL 2023

Warning: This paper contains content that may be offensive or upsetting. Understanding the harms and offensiveness of statements requires reasoning about the social and situational context in which statements are made. For example, the utterance “your English is very good” may implicitly signal an insult when uttered by a white man to a non-white colleague, but uttered by an ESL teacher to their student would be interpreted as a genuine compliment. Such contextual factors have been largely ignored by previous approaches to toxic language detection. We introduce COBRA frames, the first context-aware formalism for explaining the intents, reactions, and harms of offensive or biased statements grounded in their social and situational context. We create COBRACORPUS, a dataset of 33k potentially offensive statements paired with machine-generated contexts and free-text explanations of offensiveness, implied biases, speaker intents, and listener reactions. To study the contextual dynamics of offensiveness, we train models to generate COBRA explanations, with and without access to the context. We find that explanations by context-agnostic models are significantly worse than by context-aware ones, especially in situations where the context inverts the statement’s offensiveness (29% accuracy drop). Our work highlights the importance and feasibility of contextualized NLP by modeling social factors.

2022

Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
Maarten Sap | Swabha Swayamdipta | Laura Vianna | Xuhui Zhou | Yejin Choi | Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The perceived toxicity of language can vary based on someone’s identity and beliefs, but this variation is often ignored when collecting toxic language datasets, resulting in dataset and model biases. We seek to understand the *who*, *why*, and *what* behind biases in toxicity annotations. In two online studies with demographically and politically diverse participants, we investigate the effect of annotator identities (*who*) and beliefs (*why*), drawing from social psychology research about hate speech, free speech, racist beliefs, political leaning, and more. We disentangle *what* is annotated as toxic by considering posts with three characteristics: anti-Black language, African American English (AAE) dialect, and vulgarity. Our results show strong associations between annotator identity and beliefs and their ratings of toxicity. Notably, more conservative annotators and those who scored highly on our scale for racist beliefs were less likely to rate anti-Black language as toxic, but more likely to rate AAE as toxic. We additionally present a case study illustrating how a popular toxicity detection system’s ratings inherently reflect only specific beliefs and perspectives. Our findings call for contextualizing toxicity labels in social variables, which raises immense implications for toxic language annotation and detection.

Reframing Human-AI Collaboration for Generating Free-Text Explanations
Sarah Wiegreffe | Jack Hessel | Swabha Swayamdipta | Mark Riedl | Yejin Choi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Large language models are increasingly capable of generating fluent-appearing text with relatively little task-specific supervision. But can these models accurately explain classification decisions? We consider the task of generating free-text explanations using human-written examples in a few-shot manner. We find that (1) authoring higher quality prompts results in higher quality generations; and (2) surprisingly, in a head-to-head comparison, crowdworkers often prefer explanations generated by GPT-3 to crowdsourced explanations in existing datasets. Our human studies also show, however, that while models often produce factual, grammatical, and sufficient explanations, they have room to improve along axes such as providing novel information and supporting the label. We create a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop. Despite the intrinsic subjectivity of acceptability judgments, we demonstrate that acceptability is partially correlated with various fine-grained attributes of explanations. Our approach is able to consistently filter GPT-3-generated explanations deemed acceptable by humans.

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Alisa Liu | Swabha Swayamdipta | Noah A. Smith | Yejin Choi
Findings of the Association for Computational Linguistics: EMNLP 2022

A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We introduce a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI for natural language inference (NLI), our approach uses dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers. The resulting dataset, WANLI, consists of 107,885 NLI examples and presents unique empirical strengths over existing NLI datasets. Remarkably, training a model on WANLI improves performance on eight out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI, compared to training on the 4x larger MultiNLI. Moreover, it continues to be more effective than MultiNLI augmented with other NLI datasets. Our results demonstrate the promise of leveraging natural language generation techniques and re-imagining the role of humans in the dataset creation process.

NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation
Phillip Howard | Gadi Singer | Vasudev Lal | Yejin Choi | Swabha Swayamdipta
Findings of the Association for Computational Linguistics: EMNLP 2022

While counterfactual data augmentation offers a promising step towards robust generalization in natural language processing, producing a set of counterfactuals that offer valuable inductive bias for models remains a challenge. Most existing approaches for producing counterfactuals, manual or automated, rely on small perturbations via minimal edits, resulting in simplistic changes. We introduce NeuroCounterfactuals, designed as loose counterfactuals, allowing for larger edits which result in naturalistic generations containing linguistic diversity, while still bearing similarity to the original document. Our novel generative approach bridges the benefits of constrained decoding, with those of language model adaptation for sentiment steering. Training data augmentation with our generations results in both in-domain and out-of-domain improvements for sentiment classification, outperforming even manually curated counterfactuals, under select settings. We further present detailed analyses to show the advantages of NeuroCounterfactuals over approaches involving simple, minimal edits.

Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Colin Cherry | Angela Fan | George Foster | Gholamreza (Reza) Haffari | Shahram Khadivi | Nanyun (Violet) Peng | Xiang Ren | Ehsan Shareghi | Swabha Swayamdipta
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Investigating the Benefits of Free-Form Rationales
Jiao Sun | Swabha Swayamdipta | Jonathan May | Xuezhe Ma
Findings of the Association for Computational Linguistics: EMNLP 2022

Free-form rationales aim to aid model interpretability by supplying the background knowledge that can help understand model decisions. Crowdsourced rationales are provided for commonsense QA instances in popular datasets such as CoS-E and ECQA, but their utility remains under-investigated. We present human studies which show that ECQA rationales indeed provide additional background information to understand a decision, while over 88% of CoS-E rationales do not. Inspired by this finding, we ask: can the additional context provided by free-form rationales benefit models, similar to human users? We investigate the utility of rationales as an additional source of supervision, by varying the quantity and quality of rationales during training. After controlling for instances where rationales leak the correct answer while not providing additional background knowledge, we find that incorporating only 5% of rationales during training can boost model performance by 47.22% for CoS-E and 57.14% for ECQA during inference. Moreover, we also show that rationale quality matters: compared to crowdsourced rationales, T5-generated rationales provide not only weaker supervision to models, but are also not helpful for humans in aiding model interpretability.

2021

Sister Help: Data Augmentation for Frame-Semantic Role Labeling
Ayush Pancholy | Miriam R L Petruck | Swabha Swayamdipta
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

While FrameNet is widely regarded as a rich resource of semantics in natural language processing, a major criticism concerns its lack of coverage and the relative paucity of its labeled data compared to other commonly used lexical resources such as PropBank and VerbNet. This paper reports on a pilot study to address these gaps. We propose a data augmentation approach, which uses existing frame-specific annotation to automatically annotate other lexical units of the same frame which are unannotated. Our rule-based approach defines the notion of a **sister lexical unit** and generates frame-specific augmented data for training. We present experiments on frame-semantic role labeling which demonstrate the importance of this data augmentation: we obtain a large improvement to prior results on frame identification and argument identification for FrameNet, utilizing both full-text and lexicographic annotations under FrameNet. Our findings on data augmentation highlight the value of automatic resource creation for improved models in frame-semantic parsing.

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Alisa Liu | Maarten Sap | Ximing Lu | Swabha Swayamdipta | Chandra Bhagavatula | Noah A. Smith | Yejin Choi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DExperts: Decoding-time Experts, a decoding-time method for controlled text generation that combines a pretrained language model with “expert” LMs and/or “anti-expert” LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts, and unlikely by the anti-experts. We apply DExperts to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DExperts operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.

Challenges in Automated Debiasing for Toxic Language Detection
Xuhui Zhou | Maarten Sap | Swabha Swayamdipta | Yejin Choi | Noah Smith
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

Contrastive Explanations for Model Interpretability
Alon Jacovi | Swabha Swayamdipta | Shauli Ravfogel | Yanai Elazar | Yejin Choi | Yoav Goldberg
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Contrastive explanations clarify why an event occurred in contrast to another. They are inherently intuitive to humans to both produce and comprehend. We propose a method to produce contrastive explanations in the latent space, via a projection of the input representation, such that only the features that differentiate two potential decisions are captured. Our modification allows model behavior to consider only contrastive reasoning, and uncover which aspects of the input are useful for and against particular decisions. Our contrastive explanations can additionally answer for which label, and against which alternative label, is a given input feature useful. We produce contrastive explanations via both high-level abstract concept attribution and low-level input token/span attribution for two NLP classification benchmarks. Our findings demonstrate the ability of label-contrastive explanations to provide fine-grained interpretability of model decisions.

2020

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Suchin Gururangan | Ana Marasović | Swabha Swayamdipta | Kyle Lo | Iz Beltagy | Doug Downey | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

Generative Data Augmentation for Commonsense Reasoning
Yiben Yang | Chaitanya Malaviya | Jared Fernandez | Swabha Swayamdipta | Ronan Le Bras | Ji-Ping Wang | Chandra Bhagavatula | Yejin Choi | Doug Downey
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent advances in commonsense reasoning depend on large-scale human-annotated training sets to achieve peak performance. However, manual curation of training sets is expensive and has been shown to introduce annotation artifacts that neural models can readily exploit and overfit to. We propose a novel generative data augmentation technique, G-DAUGˆC, that aims to achieve more accurate and robust learning in a low-resource setting. Our approach generates synthetic examples using pretrained language models and selects the most informative and diverse set of examples for data augmentation. On experiments with multiple commonsense reasoning benchmarks, G-DAUGˆC consistently outperforms existing data augmentation methods based on back-translation, establishing a new state-of-the-art on WinoGrande, CODAH, and CommonsenseQA, as well as enhances out-of-distribution generalization, proving to be robust against adversaries or perturbations. Our analysis demonstrates that G-DAUGˆC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.

The Right Tool for the Job: Matching Model and Instance Complexities
Roy Schwartz | Gabriel Stanovsky | Swabha Swayamdipta | Jesse Dodge | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code.

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Swabha Swayamdipta | Roy Schwartz | Nicholas Lourie | Yizhong Wang | Hannaneh Hajishirzi | Noah A. Smith | Yejin Choi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps—a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example—the model’s confidence in the true class, and the variability of this confidence across epochs—obtained in a single run of training. Experiments on four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of “ambiguous” regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are “easy to learn” for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds “hard to learn”; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

2019

Transfer Learning in Natural Language Processing
Sebastian Ruder | Matthew E. Peters | Swabha Swayamdipta | Thomas Wolf
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials

The classic supervised machine learning paradigm is based on learning in isolation, a single predictive model for a task using a single dataset. This approach requires a large number of training examples and performs best for well-defined and narrow tasks. Transfer learning refers to a set of methods that extend this approach by leveraging data from additional domains or tasks to train a model with better generalization properties. Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of several transfer learning methods and architectures which significantly improved upon the state-of-the-art on a wide range of NLP tasks. These improvements together with the wide availability and ease of integration of these methods are reminiscent of the factors that led to the success of pretrained word embeddings and ImageNet pretraining in computer vision, and indicate that these methods will likely become a common tool in the NLP landscape as well as an important research direction. We will present an overview of modern transfer learning methods in NLP, how models are pre-trained, what information the representations they learn capture, and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks.

Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)
Colin Cherry | Greg Durrett | George Foster | Reza Haffari | Shahram Khadivi | Nanyun Peng | Xiang Ren | Swabha Swayamdipta
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

2018

Learning Joint Semantic Parsers from Disjoint Data
Hao Peng | Sam Thomson | Swabha Swayamdipta | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present a new approach to learning a semantic parser from multiple datasets, even when the target semantic formalisms are drastically different and the underlying corpora do not overlap. We handle such “disjoint” data by treating annotations for unobserved formalisms as latent structured variables. Building on state-of-the-art baselines, we show improvements both in frame-semantic parsing and semantic dependency parsing by modeling them jointly.

Annotation Artifacts in Natural Language Inference Data
Suchin Gururangan | Swabha Swayamdipta | Omer Levy | Roy Schwartz | Samuel Bowman | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.

Polyglot Semantic Role Labeling
Phoebe Mulcaire | Swabha Swayamdipta | Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Previous approaches to multilingual semantic dependency parsing treat languages independently, without exploiting the similarities between semantic structures across languages. We experiment with a new approach where we combine resources from different languages in the CoNLL 2009 shared task to build a single polyglot semantic dependency parser. Notwithstanding the absence of parallel data, and the dissimilarity in annotations between languages, our approach results in improvement in parsing performance on several languages over a monolingual baseline. Analysis of the polyglot models’ performance provides a new understanding of the similarities and differences between languages in the shared task.

Syntactic Scaffolds for Semantic Structures
Swabha Swayamdipta | Sam Thomson | Kenton Lee | Luke Zettlemoyer | Chris Dyer | Noah A. Smith
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce the syntactic scaffold, an approach to incorporating syntactic information into semantic tasks. Syntactic scaffolds avoid expensive syntactic processing at runtime, only making use of a treebank during training, through a multitask objective. We improve over strong baselines on PropBank semantics, frame semantics, and coreference resolution, achieving competitive performance on all three tasks.

Frame Semantics across Languages: Towards a Multilingual FrameNet
Collin F. Baker | Michael Ellsworth | Miriam R. L. Petruck | Swabha Swayamdipta
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

2016

Greedy, Joint Syntactic-Semantic Parsing with Stack LSTMs
Swabha Swayamdipta | Miguel Ballesteros | Chris Dyer | Noah A. Smith
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

2014

A Dependency Parser for Tweets
Lingpeng Kong | Nathan Schneider | Swabha Swayamdipta | Archna Bhatia | Chris Dyer | Noah A. Smith
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

CMU: Arc-Factored, Discriminative Semantic Dependency Parsing
Sam Thomson | Brendan O’Connor | Jeffrey Flanigan | David Bamman | Jesse Dodge | Swabha Swayamdipta | Nathan Schneider | Chris Dyer | Noah A. Smith
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

The CMU Machine Translation Systems at WMT 2014
Austin Matthews | Waleed Ammar | Archna Bhatia | Weston Feely | Greg Hanneman | Eva Schlinger | Swabha Swayamdipta | Yulia Tsvetkov | Alon Lavie | Chris Dyer
Proceedings of the Ninth Workshop on Statistical Machine Translation

Co-authors

Chandra Bhagavatula 3

Archna Bhatia 2

George Foster 2

Suchin Gururangan 2

Phillip Howard 2

Jena D. Hwang 2

Shahram Khadivi 2

Ronan Le Bras 2

Miriam R. L. Petruck 2

Barbara Plank 2

Nathan Schneider 2

Jörg Tiedemann 2

Raúl Vázquez 2

Marie-Catherine de Marneffe 2

Collin F. Baker 1

Miguel Ballesteros 1

Jonathan Berant 1

Jayne Bottarini 1

Samuel Bowman 1

Faeze Brahman 1

Yang (Trista) Cao 1

Hande Celikkanat 1

Souti Chattopadhyay 1

Thomas Davidson 1

Xingjian Dong 1

Michael Ellsworth 1

Jared Fernandez 1

Francis Ferraro 1

Jeffrey Flanigan 1

Yoav Goldberg 1

Gholamreza Haffari 1

Gholamreza (Reza) Haffari 1

Hannaneh Hajishirzi 1

Greg Hanneman 1

Carlos Hurtado 1

Alexander Koller 1

Rik Koncel-Kedziorski 1

Lingpeng Kong 1

Olga Koumoundouros 1

Atharva Kulkarni 1

Nicholas Lourie 1

Chaitanya Malaviya 1

Ana Marasović 1

Austin Matthews 1

Julian Michael 1

Joel Ruben Antony Moniz 1

Phoebe Mulcaire 1

Antonio Ortega 1

Anaelia Ovalle 1

Brendan O’Connor 1

Ayush Pancholy 1

Isabel Papadimitriou 1

Nanyun (Violet) Peng 1

Matthew E. Peters 1

Dhivya Piraviperumal 1

Sahana Ramnath 1

Jaspreet Ranjit 1

Shauli Ravfogel 1

Sebastian Ruder 1

Sadra Sabouri 1

Keisuke Sakaguchi 1

Eva Schlinger 1

Ehsan Shareghi 1

Sarath Shekkizhar 1

Artem Shelmanov 1

Tejas Srinivasan 1

Gabriel Stanovsky 1

Bo-Hsiang Tseng 1

Yulia Tsvetkov 1

Sarah Wiegreffe 1

Akhila Yerukola 1

Marcos Zampieri 1

Chrysoula Zerva 1

Luke Zettlemoyer 1

Venues