Emily Allaway - ACL Anthology

Emily Allaway

2026

Analyzing LLM Instruction Optimization for Tabular Fact Verification
Xiaotang Du | Giwon Hong | Wai-Chung Kwan | Rohit Saxena | Ivan Titov | Pasquale Minervini | Emily Allaway
Findings of the Association for Computational Linguistics: EACL 2026

Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework—COPRO, MiPROv2, and SIMBA—across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.

2025

Proceedings of the 9th Widening NLP Workshop
Chen Zhang | Emily Allaway | Hua Shen | Lesly Miculicich | Yinqiao Li | Meryem M'hamdi | Peerat Limkonchotiwat | Richard He Bai | Santosh T.y.s.s. | Sophia Simeng Han | Surendrabikram Thapa | Wiem Ben Rim
Proceedings of the 9th Widening NLP Workshop

VISaGE: Understanding Visual Generics and Exceptions
Stella Frank | Emily Allaway
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances

Evaluating Defeasible Reasoning in LLMs with DEFREASING
Emily Allaway | Kathleen McKeown
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

MGEN: Millions of Naturally Occurring Generics in Context
Gustavo Cilleruelo | Emily Allaway | Barry Haddow | Alexandra Birch
Proceedings of the Society for Computation in Linguistics 2025

Generics are puzzling. Can language models find the missing piece?
Gustavo Cilleruelo | Emily Allaway | Barry Haddow | Alexandra Birch
Proceedings of the 31st International Conference on Computational Linguistics

Generic sentences express generalisations about the world without explicit quantification. Although generics are central to everyday communication, building a precise semantic framework has proven difficult, in part because speakers use generics to generalise properties with widely different statistical prevalence. In this work, we study the implicit quantification and context-sensitivity of generics by leveraging language models as models of language. We create ConGen, a dataset of 2873 naturally occurring generic and quantified sentences in context, and define p-acceptability, a metric based on surprisal that is sensitive to quantification. Our experiments show generics are more context-sensitive than determiner quantifiers and about 20% of naturally occurring generics we analyze express weak generalisations. We also explore how human biases in stereotypes can be observed in language models.

2024

Exceptions, Instantiations, and Overgeneralization: Insights into How Language Models Process Generics
Emily Allaway | Chandra Bhagavatula | Jena D. Hwang | Kathleen McKeown | Sarah-Jane Leslie
Computational Linguistics, Volume 50, Issue 4 - December 2024

Large language models (LLMs) have garnered a great deal of attention for their exceptional generative performance on commonsense and reasoning tasks. In this work, we investigate LLMs’ capabilities for generalization using a particularly challenging type of statement: generics. Generics express generalizations (e.g., birds can fly) but do so without explicit quantification. They are notable because they generalize over their instantiations (e.g., sparrows can fly) yet hold true even in the presence of exceptions (e.g., penguins do not). For humans, these generic generalizations play a fundamental role in cognition, concept acquisition, and intuitive reasoning. We investigate how LLMs respond to and reason about generics. To this end, we first propose a framework grounded in pragmatics to automatically generate both exceptions and instantiations – collectively exemplars. We make use of focus—a pragmatic phenomenon that highlights meaning-bearing elements in a sentence—to capture the full range of interpretations of generics across different contexts of use. This allows us to derive precise logical definitions for exemplars and operationalize them to automatically generate exemplars from LLMs. Using our system, we generate a dataset of ∼370kexemplars across ∼17k generics and conduct a human validation of a sample of the generated data. We use our final generated dataset to investigate how LLMs reason about generics. Humans have a documented tendency to conflate universally quantified statements (e.g., all birds can fly) with generics. Therefore, we probe whether LLMs exhibit similar overgeneralization behavior in terms of quantification and in property inheritance. We find that LLMs do show evidence of overgeneralization, although they sometimes struggle to reason about exceptions. Furthermore, we find that LLMs may exhibit similar non-logical behavior to humans when considering property inheritance from generics.

2023

Beyond Denouncing Hate: Strategies for Countering Implied Biases and Stereotypes in Language
Jimin Mun | Emily Allaway | Akhila Yerukola | Laura Vianna | Sarah-Jane Leslie | Maarten Sap
Findings of the Association for Computational Linguistics: EMNLP 2023

Counterspeech, i.e., responses to counteract potential harms of hateful speech, has become an increasingly popular solution to address online hate speech without censorship. However, properly countering hateful language requires countering and dispelling the underlying inaccurate stereotypes implied by such language. In this work, we draw from psychology and philosophy literature to craft six psychologically inspired strategies to challenge the underlying stereotypical implications of hateful language. We first examine the convincingness of each of these strategies through a user study, and then compare their usages in both human- and machine-generated counterspeech datasets. Our results show that human-written counterspeech uses countering strategies that are more specific to the implied stereotype (e.g., counter examples to the stereotype, external factors about the stereotype’s origins), whereas machine-generated counterspeech uses less specific strategies (e.g., generally denouncing the hatefulness of speech). Furthermore, machine generated counterspeech often employs strategies that humans deem less convincing compared to human-produced counterspeech. Our findings point to the importance of accounting for the underlying stereotypical implications of speech when generating counterspeech and for better machine reasoning about anti-stereotypical examples.

Penguins Don’t Fly: Reasoning about Generics through Instantiations and Exceptions
Emily Allaway | Jena D. Hwang | Chandra Bhagavatula | Kathleen McKeown | Doug Downey | Yejin Choi
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Generics express generalizations about the world (e.g., birds can fly) that are not universally true (e.g., newborn birds and penguins cannot fly). Commonsense knowledge bases, used extensively in NLP, encode some generic knowledge but rarely enumerate such exceptions and knowing when a generic statement holds or does not hold true is crucial for developing a comprehensive understanding of generics. We present a novel framework informed by linguistic theory to generate exemplars—specific cases when a generic holds true or false. We generate ~19k exemplars for ~650 generics and show that our framework outperforms a strong GPT-3 baseline by 12.8 precision points. Our analysis highlights the importance of linguistic theory-based controllability for generating exemplars, the insufficiency of knowledge bases as a source of exemplars, and the challenges exemplars pose for the task of natural language inference.

2022

Mitigating Covertly Unsafe Text within Natural Language Systems
Alex Mei | Anisha Kabir | Sharon Levy | Melanie Subbiah | Emily Allaway | John Judge | Desmond Patton | Bruce Bimber | Kathleen McKeown | William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2022

An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system’s information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.

SafeText: A Benchmark for Exploring Physical Safety in Language Models
Sharon Levy | Emily Allaway | Melanie Subbiah | Lydia Chilton | Desmond Patton | Kathleen McKeown | William Yang Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.

Legal and Political Stance Detection of SCOTUS Language
Noah Bergam | Emily Allaway | Kathleen Mckeown
Proceedings of the Natural Legal Language Processing Workshop 2022

We analyze publicly available US Supreme Court documents using automated stance detection. In the first phase of our work, we investigate the extent to which the Court’s public-facing language is political. We propose and calculate two distinct ideology metrics of SCOTUS justices using oral argument transcripts. We then compare these language-based metrics to existing social scientific measures of the ideology of the Supreme Court and the public. Through this cross-disciplinary analysis, we find that justices who are more responsive to public opinion tend to express their ideology during oral arguments. This observation provides a new kind of evidence in favor of the attitudinal change hypothesis of Supreme Court justice behavior. As a natural extension of this political stance detection, we propose the more specialized task of legal stance detection with our new dataset SC-stance, which matches written opinions to legal questions. We find competitive performance on this dataset using language adapters trained on legal documents.

Seeded Hierarchical Clustering for Expert-Crafted Taxonomies
Anish Saha | Amith Ananthram | Emily Allaway | Heng Ji | Kathleen McKeown
Findings of the Association for Computational Linguistics: EMNLP 2022

Practitioners from many disciplines (e.g., political science) use expert-crafted taxonomies to make sense of large, unlabeled corpora. In this work, we study Seeded Hierarchical Clustering (SHC): the task of automatically fitting unlabeled data to such taxonomies using a small set of labeled examples. We propose HierSeed, a novel weakly supervised algorithm for this task that uses only a small set of labeled seed examples in a computation and data efficient manner. HierSeed assigns documents to topics by weighing document density against topic hierarchical structure. It outperforms unsupervised and supervised baselines for the SHC task on three real-world datasets.

Mapping the Multilingual Margins: Intersectional Biases of Sentiment Analysis Systems in English, Spanish, and Arabic
António Câmara | Nina Taneja | Tamjeed Azad | Emily Allaway | Richard Zemel
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

As natural language processing systems become more widespread, it is necessary to address fairness issues in their implementation and deployment to ensure that their negative impacts on society are understood and minimized. However, there is limited work that studies fairness using a multilingual and intersectional framework or on downstream tasks. In this paper, we introduce four multilingual Equity Evaluation Corpora, supplementary test sets designed to measure social biases, and a novel statistical framework for studying unisectional and intersectional social biases in natural language processing. We use these tools to measure gender, racial, ethnic, and intersectional social biases across five models trained on emotion regression tasks in English, Spanish, and Arabic. We find that many systems demonstrate statistically significant unisectional and intersectional social biases. We make our code and datasets available for download.

2021

Adversarial Learning for Zero-Shot Stance Detection on Social Media
Emily Allaway | Malavika Srikanth | Kathleen McKeown
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Stance detection on social media can help to identify and understand slanted news or commentary in everyday life. In this work, we propose a new model for zero-shot stance detection on Twitter that uses adversarial learning to generalize across topics. Our model achieves state-of-the-art performance on a number of unseen test topics with minimal computational costs. In addition, we extend zero-shot stance detection to topics not previously considered, highlighting future directions for zero-shot transfer.

Sequential Cross-Document Coreference Resolution
Emily Allaway | Shuai Wang | Miguel Ballesteros
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Relating entities and events in text is a key component of natural language understanding. Cross-document coreference resolution, in particular, is important for the growing interest in multi-document analysis tasks. In this work we propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings and achieves competitive results for both entity and event coreference while providing strong evidence of the efficacy of both sequential models and higher-order inference in cross-document settings. Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters, approximating a higher-order model. In addition, we conduct extensive ablation studies that provide new insights into the importance of various inputs and representation types in coreference.

A Unified Feature Representation for Lexical Connotations
Emily Allaway | Kathleen McKeown
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Ideological attitudes and stance are often expressed through subtle meanings of words and phrases. Understanding these connotations is critical to recognizing the cultural and emotional perspectives of the speaker. In this paper, we use distant labeling to create a new lexical resource representing connotation aspects for nouns and adjectives. Our analysis shows that it aligns well with human judgments. Additionally, we present a method for creating lexical representations that capture connotations within the embedding space and show that using the embeddings provides a statistically significant improvement on the task of stance detection when data is limited.

Does Putting a Linguist in the Loop Improve NLU Data Collection?
Alicia Parrish | William Huang | Omar Agha | Soo-Hwan Lee | Nikita Nangia | Alexia Warstadt | Karmanya Aggarwal | Emily Allaway | Tal Linzen | Samuel R. Bowman
Findings of the Association for Computational Linguistics: EMNLP 2021

Many crowdsourced NLP datasets contain systematic artifacts that are identified only after data collection is complete. Earlier identification of these issues should make it easier to create high-quality training and evaluation data. We attempt this by evaluating protocols in which expert linguists work ‘in the loop’ during data collection to identify and address these issues by adjusting task instructions and incentives. Using natural language inference as a test case, we compare three data collection protocols: (i) a baseline protocol with no linguist involvement, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the writing task, and (iii) an extension that adds direct interaction between linguists and crowdworkers via a chatroom. We find that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data. Linguist involvement does, however, lead to more challenging evaluation data and higher accuracy on some challenge sets, demonstrating the benefits of integrating expert analysis during data collection.

Human Rationales as Attribution Priors for Explainable Stance Detection
Sahil Jayaram | Emily Allaway
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

As NLP systems become better at detecting opinions and beliefs from text, it is important to ensure not only that models are accurate but also that they arrive at their predictions in ways that align with human reasoning. In this work, we present a method for imparting human-like rationalization to a stance detection model using crowdsourced annotations on a small fraction of the training data. We show that in a data-scarce setting, our approach can improve the reasoning of a state-of-the-art classifier—particularly for inputs containing challenging phenomena such as sarcasm—at no cost in predictive performance. Furthermore, we demonstrate that attention weights surpass a leading attribution method in providing faithful explanations of our model’s predictions, thus serving as a computationally cheap and reliable source of attributions for our model.

2020

Event-Guided Denoising for Multilingual Relation Learning
Amith Ananthram | Emily Allaway | Kathleen McKeown
Proceedings of the 28th International Conference on Computational Linguistics

General purpose relation extraction has recently seen considerable gains in part due to a massively data-intensive distant supervision technique from Soares et al. (2019) that produces state-of-the-art results across many benchmarks. In this work, we present a methodology for collecting high quality training data for relation extraction from unlabeled text that achieves a near-recreation of their zero-shot and few-shot results at a fraction of the training cost. Our approach exploits the predictable distributional structure of date-marked news articles to build a denoised corpus – the extraction process filters out low quality examples. We show that a smaller multilingual encoder trained on this corpus performs comparably to the current state-of-the-art (when both receive little to no fine-tuning) on few-shot and standard relation benchmarks in English and Spanish despite using many fewer examples (50k vs. 300mil+).

Zero-Shot Stance Detection: A Dataset and Model using Generalized Topic Representations
Emily Allaway | Kathleen McKeown
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Stance detection is an important component of understanding hidden influences in everyday life. Since there are thousands of potential topics to take a stance on, most with little to no training data, we focus on zero-shot stance detection: classifying stance from no training examples. In this paper, we present a new dataset for zero-shot stance detection that captures a wider range of topics and lexical variation than in previous datasets. Additionally, we propose a new model for stance detection that implicitly captures relationships between topics using generalized topic representations and show that this model improves performance on a number of challenging linguistic phenomena.

2018

Event2Mind: Commonsense Inference on Events, Intents, and Reactions
Hannah Rashkin | Maarten Sap | Emily Allaway | Noah A. Smith | Yejin Choi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate a new commonsense inference task: given an event described in a short free-form text (“X drinks coffee in the morning”), a system reasons about the likely intents (“X wants to stay awake”) and reactions (“X feels alert”) of the event’s participants. To support this study, we construct a new crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations. We report baseline performance on this task, demonstrating that neural encoder-decoder models can successfully compose embedding representations of previously unseen events and reason about the likely intents and reactions of the event participants. In addition, we demonstrate how commonsense inference on people’s intents and reactions can help unveil the implicit gender inequality prevalent in modern movie scripts.

Co-authors

Gustavo Cilleruelo 2

Jena D. Hwang 2

Sarah-Jane Leslie 2

Desmond Patton 2

Melanie Subbiah 2

William Yang Wang 2

Karmanya Aggarwal 1

Richard He Bai 1

Miguel Ballesteros 1

Samuel R. Bowman 1

Lydia Chilton 1

António Câmara 1

Sophia Simeng Han 1

William Huang 1

Sahil Jayaram 1

Wai Chung Kwan 1

Peerat Limkonchotiwat 1

Lesly Miculicich Werlen 1

Pasquale Minervini 1

Meryem M’hamdi 1

Nikita Nangia 1

Alicia Parrish 1

Hannah Rashkin 1

Noah A. Smith 1

Malavika Srikanth 1

Santosh T.Y.S.S 1

Surendrabikram Thapa 1

Alexia Warstadt 1

Akhila Yerukola 1

Richard Zemel 1

Venues