Elliott Ash - ACL Anthology

Elliott Ash

2025

The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
Yu Fan | Yang Tian | Shauli Ravfogel | Mrinmaya Sachan | Elliott Ash | Alexander Hoyle
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text’s source or language. These document confounders cause problems for many applications, but especially those that need to pool texts from different corpora. This paper shows that a debiasing algorithm that removes information about observed confounders from the encoder representations substantially reduces these biases at a minimal computational cost. Document similarity and clustering metrics improve across every embedding variant and task we evaluate—often dramatically. Interestingly, performance on out-of-distribution benchmarks is not impacted, indicating that the embeddings are not otherwise degraded.

Measuring scalar constructs in social science with LLMs
Hauke Licht | Rupak Sarkar | Patrick Y. Wu | Pranav Goel | Niklas Stoehr | Elliott Ash | Alexander Miserlis Hoyle
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Many constructs that characterize language, like its complexity or emotionality, have a naturally continuous semantic structure; a public speech is not just “simple” or “complex”, but exists on a continuum between extremes. Although large language models (LLMs) are an attractive tool for measuring scalar constructs, their idiosyncratic treatment of numerical outputs raises questions of how to best apply them. We address these questions with a comprehensive evaluation of LLM-based approaches to scalar construct measurement in social science. Using multiple datasets sourced from the political science literature, we evaluate four approaches: unweighted direct pointwise scoring, aggregation of pairwise comparisons, token-probability-weighted pointwise scoring, and finetuning. Our study finds that pairwise comparisons made by LLMs produce better measurements than simply prompting the LLM to directly output the scores, which suffers from bunching around arbitrary numbers. However, taking the weighted mean over the token probability of scores further improves the measurements over the two previous approaches. Finally, finetuning smaller models with as few as 1,000 training pairs can match or exceed the performance of prompted LLMs.

Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
Chenfei Xiong | Jingwei Ni | Yu Fan | Vilém Zouhar | Donya Rooein | Lorena Calvo-Bartolomé | Alexander Hoyle | Zhijing Jin | Mrinmaya Sachan | Markus Leippold | Dirk Hovy | Mennatallah El-Assady | Elliott Ash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.

DIRAS: Efficient LLM Annotation of Document Relevance for Retrieval Augmented Generation
Jingwei Ni | Tobias Schimanski | Meihong Lin | Mrinmaya Sachan | Elliott Ash | Markus Leippold
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information when answering queries that need an integrated analysis of information (e.g., Tell me good news in the stock market today.)? To address these concerns, RAG developers need to annotate information retrieval (IR) data for their domain of interest, which is challenging because (1) domain-specific queries usually need nuanced definitions of relevance beyond shallow semantic relevance; and (2) human or GPT-4 annotation is costly and cannot cover all (query, document) pairs (i.e., annotation selection bias), thus harming the effectiveness in evaluating IR recall. To address these challenges, we propose DIRAS (**D**omain-specific **I**nformation **R**etrieval **A**nnotation with **S**calability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to consider nuanced relevance definition and annotate (partial) relevance labels with calibrated relevance scores. Extensive evaluation shows that DIRAS enables smaller (8B) LLMs to achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.

2024

AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators
Jingwei Ni | Minjing Shi | Dominik Stammbach | Mrinmaya Sachan | Elliott Ash | Markus Leippold
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the rise of generative AI, automated fact-checking methods to combat misinformation are becoming more and more important. However, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. To address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. To address (2), we introduce AFaCTA (Automatic Factual Claim deTection Annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (LLMs). AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths. Extensive evaluation and experiments in the domain of political speech reveal that AFaCTA can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.

Towards Faithful and Robust LLM Specialists for Evidence-Based Question-Answering
Tobias Schimanski | Jingwei Ni | Mathias Kraus | Elliott Ash | Markus Leippold
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data improves performance on both in- and out-of-distribution. Furthermore, we show that data quality, which can be drastically improved by proposed quality filters, matters more than quantity in improving Evidence-Based QA.

Where Do People Tell Stories Online? Story Detection Across Online Communities
Maria Antoniak | Joel Mire | Maarten Sap | Elliott Ash | Andrew Piper
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Story detection in online communities is a challenging task as stories are scattered across communities and interwoven with non-storytelling spans within a single text. We address this challenge by building and releasing the StorySeeker toolkit, including a richly annotated dataset of 502 Reddit posts and comments, a detailed codebook adapted to the social media context, and models to predict storytelling at the document and span levels. Our dataset is sampled from hundreds of popular English-language Reddit communities ranging across 33 topic categories, and it contains fine-grained expert annotations, including binary story labels, story spans, and event spans. We evaluate a range of detection methods using our data, and we identify the distinctive textual features of online storytelling, focusing on storytelling spans, which we introduce as a new task. We illuminate distributional characteristics of storytelling on a large community-centric social media platform, and we also conduct a case study on r/ChangeMyView, where storytelling is used as one of many persuasive strategies, illustrating that our data and models can be used for both inter- and intra-community research. Finally, we discuss implications of our tools and analyses for narratology and the study of online communities.

Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback
Maria Lerner | Florian Dorner | Elliott Ash | Naman Goel
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

There is a growing body of work on learning from human feedback to align various aspects of machine learning systems with human values and preferences. We consider the setting of fairness in content moderation, in which human feedback is used to determine how two comments — referencing different sensitive attribute groups — should be treated in comparison to one another. With a novel dataset collected from Prolific and MTurk, we find significant gaps in fairness preferences depending on the race, age, political stance, educational level, and LGBTQ+ identity of annotators. We also demonstrate that demographics mentioned in text have a strong influence on how users perceive individual fairness in moderation. Further, we find that differences also exist in downstream classifiers trained to predict human preferences. Finally, we observe that an ensemble, giving equal weight to classifiers trained on annotations from different demographics, performs better for different demographic intersections; compared to a single classifier that gives equal weight to each annotation.

LePaRD: A Large-Scale Dataset of Judicial Citations to Precedent
Robert Mahari | Dominik Stammbach | Elliott Ash | Alex Pentland
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present the Legal Passage Retrieval Dataset, LePaRD. LePaRD contains millions of examples of U.S. federal judges citing precedent in context. The dataset aims to facilitate work on legal passage retrieval, a challenging practice-oriented legal retrieval and reasoning task. Legal passage retrieval seeks to predict relevant passages from precedential court decisions given the context of a legal argument. We extensively evaluate various approaches on LePaRD, and find that classification-based retrieval appears to work best. Our best models only achieve a recall of 59% when trained on data corresponding to the 10,000 most-cited passages, underscoring the difficulty of legal passage retrieval. By publishing LePaRD, we provide a large-scale and high quality resource to foster further research on legal passage retrieval. We hope that research on this practice-oriented NLP task will help expand access to justice by reducing the burden associated with legal research via computational assistance. Warning: Extracts from judicial opinions may contain offensive language.

Aligning Large Language Models with Diverse Political Viewpoints
Dominik Stammbach | Philine Widmer | Eunjung Cho | Caglar Gulcehre | Elliott Ash
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models such as ChatGPT exhibit striking political biases. If users query them about political information, they often take a normative stance. To overcome this, we align LLMs with diverse political viewpoints from 100,000 comments written by candidates running for national parliament in Switzerland. Models aligned with this data can generate more accurate political viewpoints from Swiss parties, compared to commercial models such as ChatGPT. We also propose a procedure to generate balanced overviews summarizing multiple viewpoints using such models. The replication package contains all code and data.

The Empirical Variability of Narrative Perceptions of Social Media Texts
Joel Mire | Maria Antoniak | Elliott Ash | Andrew Piper | Maarten Sap
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Most NLP work on narrative detection has focused on prescriptive definitions of stories crafted by researchers, leaving open the questions: how do crowd workers perceive texts to be a story, and why? We investigate this by building StoryPerceptions, a dataset of 2,496 perceptions of storytelling in 502 social media texts from 255 crowd workers, including categorical labels along with free-text storytelling rationales, authorial intent, and more. We construct a fine-grained bottom-up taxonomy of crowd workers’ varied and nuanced perceptions of storytelling by open-coding their free-text rationales. Through comparative analyses at the label and code level, we illuminate patterns of disagreement among crowd workers and across other annotation contexts, including prescriptive labeling from researchers and LLM-based predictions. Notably, plot complexity, references to generalized or abstract actions, and holistic aesthetic judgments (such as a sense of cohesion) are especially important in disagreements. Our empirical findings broaden understanding of the types, relative importance, and contentiousness of features relevant to narrative detection, highlighting opportunities for future work on reader-contextualized models of narrative reception.

Swiss AI Initiative - Collecting Large Amounts of High-Quality Data for Training Large Language Models
Jan Deriu | Maud Ehrmann | Emanuela Boros | Maximilian Böther | Christiane Sibille | Ihor Protsenko | Marta Brucka | Imanol Schlag | Elliott Ash
Proceedings of the 9th edition of the Swiss Text Analytics Conference

2023

Uncovering and Categorizing Social Biases in Text-to-SQL
Yan Liu | Yan Gao | Zhe Su | Xiaokang Chen | Elliott Ash | Jian-Guang Lou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large pre-trained language models are acknowledged to carry social bias towards different demographics, which can further amplify existing stereotypes in our society and cause even more harm. Text-to-SQL is an important task, models of which are mainly adopted by administrative industries, where unfair decisions may lead to catastrophic consequences. However, existing Text-to-SQL models are trained on clean, neutral datasets, such as Spider and WikiSQL. This, to some extent, cover up social bias in models under ideal conditions, which nevertheless may emerge in real application scenarios. In this work, we aim to uncover and mitigate social bias in Text-to-SQL models. We summarize the categories of social bias that may occur in structural data for Text-to-SQL models. We build test benchmarks and reveal that models with similar task accuracy can contain social bias at very different rates. We show how to take advantage of our methodology to assess and mitigate social bias in the downstream Text-to-SQL task.

Revisiting Automated Topic Model Evaluation with Large Language Models
Dominik Stammbach | Vilém Zouhar | Alexander Hoyle | Mrinmaya Sachan | Elliott Ash
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Topic models help us make sense of large text collections. Automatically evaluating their output and determining the optimal number of topics are both longstanding challenges, with no effective automated solutions to date. This paper proposes using large language models (LLMs) for these tasks. We find that LLMs appropriately assess the resulting topics, correlating more strongly with human judgments than existing automated metrics. However, the setup of the evaluation task is crucial — LLMs perform better on coherence ratings of word sets than on intrustion detection. We find that LLMs can also assist us in guiding us towards a reasonable number of topics. In actual applications, topic models are typically used to answer a research question related to a collection of texts. We can incorporate this research question in the prompt to the LLM, which helps estimating the optimal number of topics.

The Law and NLP: Bridging Disciplinary Disconnects
Robert Mahari | Dominik Stammbach | Elliott Ash | Alex Pentland
Findings of the Association for Computational Linguistics: EMNLP 2023

Legal practice is intrinsically rooted in the fabric of language, yet legal practitioners and scholars have been slow to adopt tools from natural language processing (NLP). At the same time, the legal system is experiencing an access to justice crisis, which could be partially alleviated with NLP. In this position paper, we argue that the slow uptake of NLP in legal practice is exacerbated by a disconnect between the needs of the legal community and the focus of NLP researchers. In a review of recent trends in the legal NLP literature, we find limited overlap between the legal NLP community and legal academia. Our interpretation is that some of the most popular legal NLP tasks fail to address the needs of legal practitioners. We discuss examples of legal NLP tasks that promise to bridge disciplinary disconnects and highlight interesting areas for legal NLP research that remain underexplored.

2022

MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes
Nianlong Gu | Elliott Ash | Richard Hahnloser
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum’s awareness of extraction history.

DocSCAN: Unsupervised Text Classification via Learning from Neighbors
Dominik Stammbach | Elliott Ash
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

Heroes, Villains, and Victims, and GPT-3: Automated Extraction of Character Roles Without Training Data
Dominik Stammbach | Maria Antoniak | Elliott Ash
Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)

This paper shows how to use large-scale pretrained language models to extract character roles from narrative texts without domain-specific training data. Queried with a zero-shot question-answering prompt, GPT-3 can identify the hero, villain, and victim in diverse domains: newspaper articles, movie plot summaries, and political speeches.

2021

Machine Extraction of Tax Laws from Legislative Texts
Elliott Ash | Malka Guillot | Luyang Han
Proceedings of the Natural Legal Language Processing Workshop 2021

Using a corpus of compiled codes from U.S. states containing labeled tax law sections, we train text classifiers to automatically tag tax-law documents and, further, to identify the associated revenue source (e.g. income, property, or sales). After evaluating classifier performance in held-out test data, we apply them to an historical corpus of U.S. state legislation to extract the flow of relevant laws over the years 1910 through 2010. We document that the classifiers are effective in the historical corpus, for example by automatically detecting establishments of state personal income taxes. The trained models with replication code are published at https://github.com/luyang521/tax-classification.

2019

Proceedings of the Natural Legal Language Processing Workshop 2019
Nikolaos Aletras | Elliott Ash | Leslie Barrett | Daniel Chen | Adam Meyers | Daniel Preotiuc-Pietro | David Rosenberg | Amanda Stent
Proceedings of the Natural Legal Language Processing Workshop 2019

Co-authors

Venues