Orion Weller - ACL Anthology

Orion Weller

2026

CacheNotes: Task-Aware Key-Value Cache Compression for Reasoning-Intensive Knowledge Tasks
Giulio Corallo | Orion Weller | Fabio Petroni | Paolo Papotti
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Integrating external knowledge into Large Language Models (LLMs) iscrucial for many real-world applications, yet current methods like Retrieval-Augmented Generation (RAG) face limitations with broad, multi-source queries, while long-context models are computationally prohibitive.We introduce CacheNotes: Task-Aware Key-Value Cache Compression. Given a task description and a corpus, CacheNotes first generates a sequence of Compression-Planning-Tokens (CPTs), an offline task-focused distillation pass that identifies and organizes key information from the corpus. These CPTs are then used to guide a one-time compression of the corpus into a compact, reusable KV cache, which is then used alone at inference time to efficiently answer diverse, reasoning-intensive queries, eliminating repeated retrieval or context expansion.Experiments on LongBench show that, on Question-Answering tasks at a 20× compression, CacheNotes outperforms RAG by over 8 F1 points and reduces latency by over 4×. On RULER, it surpasses previous query-agnostic compression methods by 55 points, narrowing the gap to query-aware compression approaches. Additional results on real-world enterprise and synthetic datasets demonstrate its strong performance on multi-hop and broad-coverage queries.

2025

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Orion Weller | Benjamin Chang | Sean MacAvaney | Kyle Lo | Arman Cohan | Benjamin Van Durme | Dawn Lawrie | Luca Soldaini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions – also known as narratives – developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Benjamin Warner | Antoine Chaffin | Benjamin Clavié | Orion Weller | Oskar Hallström | Said Taghadouini | Alexis Gallagher | Raja Biswas | Faisal Ladhak | Tom Aarsen | Griffin Thomas Adams | Jeremy Howard | Iacopo Poli
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

CLERC: A Dataset for U. S. Legal Case Retrieval and Retrieval-Augmented Analysis Generation
Abe Bohan Hou | Orion Weller | Guanghui Qin | Eugene Yang | Dawn Lawrie | Nils Holzenberger | Andrew Blair-Stanek | Benjamin Van Durme
Findings of the Association for Computational Linguistics: NAACL 2025

Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligence systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to create a colossal dataset. supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset **CLERC** (Case Law Evaluation and Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.

2024

Defending Against Disinformation Attacks in Open-Domain Question Answering
Orion Weller | Aleem Khan | Nathaniel Weir | Dawn Lawrie | Benjamin Van Durme
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Recent work in open-domain question answering (ODQA) has shown that adversarial poisoning of the search collection can cause large drops in accuracy for production systems. However, little to no work has proposed methods to defend against these attacks. To do so, we rely on the intuition that redundant information often exists in large corpora. To find it, we introduce a method that uses query augmentation to search for a diverse set of passages that could answer the original question but are less likely to have been poisoned. We integrate these new passages into the model through the design of a novel confidence method, comparing the predicted answer to its appearance in the retrieved contexts (what we call Confidence from Answer Redundancy, i.e. CAR). Together these methods allow for a simple but effective way to defend against poisoning attacks that provides gains of nearly 20% exact match across varying levels of data poisoning/knowledge conflicts.

“According to . . . ”: Prompting Language Models Improves Quoting from Pre-Training Data
Orion Weller | Marc Marone | Nathaniel Weir | Dawn Lawrie | Daniel Khashabi | Benjamin Van Durme
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) may hallucinate and generate fake information, despite pre-training on factual data. Inspired by the journalistic device of “according to sources”, we propose according-to prompting: directing LLMs to ground responses against previously observed text. To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-produced answers are directly found in underlying text corpora. We illustrate with experiments on three corpora (Wikipedia, PubMed, and the U.S. legal tax code) that these prompts improve grounding under our metrics, with the additional benefit of often improving end-task performance. Furthermore, prompts that ask the model to decrease grounding (or to ground to other corpora) indeed decrease QUIP-Score, indicating the ability of LLMs to increase or decrease grounded generations on request.

NevIR: Negation in Neural Information Retrieval
Orion Weller | Dawn Lawrie | Benjamin Van Durme
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Negation is a common everyday phenomena and has been a consistent area of weakness for language models (LMs). Although the Information Retrieval (IR) community has adopted LMs as the backbone of modern IR architectures, there has been little to no research in understanding how negation impacts neural IR. We therefore construct a straightforward benchmark on this theme: asking IR models to rank two documents that differ only by negation. We show that the results vary widely according to the type of IR architecture: cross-encoders perform best, followed by late-interaction models, and in last place are bi-encoder and sparse neural architectures. We find that most current information retrieval models do not consider negation, performing similarly or worse than randomly ranking. We show that although the obvious approach of continued fine-tuning on a dataset of contrastive documents containing negations increases performance (as does model size), there is still a large gap between machine and human performance.

When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
Orion Weller | Kyle Lo | David Wadden | Dawn Lawrie | Benjamin Van Durme | Arman Cohan | Luca Soldaini
Findings of the Association for Computational Linguistics: EACL 2024

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.

Overview of the Fourth Workshop on Scholarly Document Processing
Tirthankar Ghosal | Amanpreet Singh | Anita De Waard | Philipp Mayr | Aakanksha Naik | Orion Weller | Yoonjoo Lee | Zejiang Shen | Yanxia Qin
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

The workshop on Scholarly Document Processing (SDP) started in 2020 to accelerate research, inform policy and educate the public on natural language processing for scientific text. The fourth iteration of the workshop, SDP24 was held at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL24) as a hybrid event. The SDP workshop saw a great increase in interest, with 57 submissions, of which 28 were accepted. The program consisted of a research track, four invited talks and two shared tasks: 1) DAGPap24: Detecting automatically generated scientific papers and 2) Context24: Multimodal Evidence and Grounding Context Identification for Scientific Claims. The program was geared towards NLP, information extraction, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.

Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)
Tirthankar Ghosal | Amanpreet Singh | Anita Waard | Philipp Mayr | Aakanksha Naik | Orion Weller | Yoonjoo Lee | Shannon Shen | Yanxia Qin
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic
Nathaniel Weir | Kate Sanders | Orion Weller | Shreya Sharma | Dongwei Jiang | Zhengping Jiang | Bhavana Dalvi Mishra | Oyvind Tafjord | Peter Jansen | Peter Clark | Benjamin Van Durme
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what _valid decompositional entailment_ is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic entailment engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency than prior decompositional entailment datasets, suggesting that RDTE is a significant step forward in the long-standing problem of forming a clear protocol for discerning entailment. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.

2023

When Do Decompositions Help for Machine Reading?
Kangda Wei | Dawn Lawrie | Benjamin Van Durme | Yunmo Chen | Orion Weller
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Answering complex questions often requires multi-step reasoning in order to obtain the final answer. Most research into decompositions of complex questions involves open-domain systems, which have shown success in using these decompositions for improved retrieval. In the machine reading setting, however, work to understand when decompositions are helpful is understudied. We conduct experiments on decompositions in machine reading to unify recent work in this space, using a range of models and datasets. We find that decompositions can be helpful in zero or limited-data settings, giving several points of improvement in exact match. However, we also show that when models are given access to around a few hundred or more examples, decompositions are not helpful (and can actually be detrimental). Thus, our analysis implies that models can learn decompositions implicitly even with limited data.

2022

End-to-End Speech Translation for Code Switched Speech
Orion Weller | Matthias Sperber | Telmo Pires | Hendra Setiawan | Christian Gollan | Dominic Telaar | Matthias Paulik
Findings of the Association for Computational Linguistics: ACL 2022

Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. We explore various ST architectures across two dimensions: cascaded (transcribe then translate) vs end-to-end (jointly transcribe and translate) and unidirectional (source -> target) vs bidirectional (source <-> target). We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.

When to Use Multi-Task Learning vs Intermediate Fine-Tuning for Pre-Trained Encoder Transfer Learning
Orion Weller | Kevin Seppi | Matt Gardner
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Transfer learning (TL) in natural language processing (NLP) has seen a surge of interest in recent years, as pre-trained models have shown an impressive ability to transfer to novel tasks. Three main strategies have emerged for making use of multiple supervised datasets during fine-tuning: training on an intermediate task before training on the target task (STILTs), using multi-task learning (MTL) to train jointly on a supplementary task and the target task (pairwise MTL), or simply using MTL to train jointly on all available datasets (MTL-ALL). In this work, we compare all three TL methods in a comprehensive analysis on the GLUE dataset suite. We find that there is a simple heuristic for when to use one of these techniques over the other: pairwise MTL is better than STILTs when the target task has fewer instances than the supporting task and vice versa. We show that this holds true in more than 92% of applicable cases on the GLUE dataset and validate this hypothesis with experiments varying dataset size. The simplicity and effectiveness of this heuristic is surprising and warrants additional exploration by the TL community. Furthermore, we find that MTL-ALL is worse than the pairwise methods in almost every case. We hope this study will aid others as they choose between TL methods for NLP tasks.

Pretrained Models for Multilingual Federated Learning
Orion Weller | Marc Marone | Vladimir Braverman | Dawn Lawrie | Benjamin Van Durme
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Since the advent of Federated Learning (FL), research has applied these methods to natural language processing (NLP) tasks. Despite a plethora of papers in FL for NLP, no previous works have studied how multilingual text impacts FL algorithms. Furthermore, multilingual text provides an interesting avenue to examine the impact of non-IID text (e.g. different languages) on FL in naturally occurring data. We explore three multilingual language tasks, language modeling, machine translation, and text classification using differing federated and non-federated learning algorithms. Our results show that using pretrained models reduces the negative effects of FL, helping them to perform near or better than centralized (no privacy) learning, even when using non-IID partitioning.

2021

Streaming Models for Joint Speech Recognition and Translation
Orion Weller | Matthias Sperber | Christian Gollan | Joris Kluivers
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Using end-to-end models for speech translation (ST) has increasingly been the focus of the ST community. These models condense the previously cascaded systems by directly converting sound waves into translated text. However, cascaded models have the advantage of including automatic speech recognition output, useful for a variety of practical ST systems that often display transcripts to the user alongside the translations. To bridge this gap, recent work has shown initial progress into the feasibility for end-to-end models to produce both of these outputs. However, all previous work has only looked at this problem from the consecutive perspective, leaving uncertainty on whether these approaches are effective in the more challenging streaming setting. We develop an end-to-end streaming ST model based on a re-translation approach and compare against standard cascading approaches. We also introduce a novel inference method for the joint case, interleaving both transcript and translation in generation and removing the need to use separate decoders. Our evaluation across a range of metrics capturing accuracy, latency, and consistency shows that our end-to-end models are statistically similar to cascading models, while having half the number of parameters. We also find that both systems provide strong translation quality at low latency, keeping 99% of consecutive quality at a lag of just under a second.

Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification
Wilson Fearn | Orion Weller | Kevin Seppi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time of their algorithms, many of which depend on the size of the corpus’ vocabulary due to their bag-of-words representation. Although many studies have examined the effect of preprocessing techniques on vocabulary size and accuracy, none have examined how these methods affect a model’s run-time. To fill this gap, we provide a comprehensive study that examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time, evaluating ten techniques over four models and two datasets. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5% of the accuracy for up to a 65% reduction of run-time. Furthermore, some combinations of preprocessing techniques can even provide a 15% reduction in run-time while simultaneously improving model accuracy.

2020

Can Humor Prediction Datasets be used for Humor Generation? Humorous Headline Generation via Style Transfer
Orion Weller | Nancy Fulda | Kevin Seppi
Proceedings of the Second Workshop on Figurative Language Processing

Understanding and identifying humor has been increasingly popular, as seen by the number of datasets created to study humor. However, one area of humor research, humor generation, has remained a difficult task, with machine generated jokes failing to match human-created humor. As many humor prediction datasets claim to aid in generative tasks, we examine whether these claims are true. We focus our experiments on the most popular dataset, included in the 2020 SemEval’s Task 7, and teach our model to take normal text and “translate” it into humorous text. We evaluate our model compared to humorous human generated headlines, finding that our model is preferred equally in A/B testing with the human edited versions, a strong success for humor generation, and is preferred over an intelligent random baseline 72% of the time. We also show that our model is assumed to be human written comparable with that of the human edited headlines and is significantly better than random, indicating that this dataset does indeed provide potential for future humor generation systems.

You Don’t Have Time to Read This: An Exploration of Document Reading Time Prediction
Orion Weller | Jordan Hildebrandt | Ilya Reznik | Christopher Challis | E. Shannon Tass | Quinn Snell | Kevin Seppi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Predicting reading time has been a subject of much previous work, focusing on how different words affect human processing, measured by reading time. However, previous work has dealt with a limited number of participants as well as word level only predictions (i.e. predicting the time to read a single word). We seek to extend these works by examining whether or not document level predictions are effective, given additional information such as subject matter, font characteristics, and readability metrics. We perform a novel experiment to examine how different features of text contribute to the time it takes to read, distributing and collecting data from over a thousand participants. We then employ a large number of machine learning methods to predict a user’s reading time. We find that despite extensive research showing that word level reading time can be most effectively predicted by neural networks, larger scale text can be easily and most accurately predicted by one factor, the number of words.

Learning from Task Descriptions
Orion Weller | Nicholas Lourie | Matt Gardner | Matthew E. Peters
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Typically, machine learning systems solve new tasks by training on thousands of examples. In contrast, humans can solve new tasks by reading some instructions, with perhaps an example or two. To take a step toward closing this gap, we introduce a framework for developing NLP systems that solve new tasks after reading their descriptions, synthesizing prior work in this area. We instantiate this frame- work with a new English language dataset, ZEST, structured for task-oriented evaluation on unseen tasks. Formulating task descriptions as questions, we ensure each is general enough to apply to many possible inputs, thus comprehensively evaluating a model’s ability to solve each task. Moreover, the dataset’s structure tests specific types of systematic generalization. We find that the state-of-the-art T5 model achieves a score of 12% on ZEST, leaving a significant challenge for NLP researchers.

The rJokes Dataset: a Large Scale Humor Collection
Orion Weller | Kevin Seppi
Proceedings of the Twelfth Language Resources and Evaluation Conference

Humor is a complicated language phenomenon that depends upon many factors, including topic, date, and recipient. Because of this variation, it can be hard to determine what exactly makes a joke humorous, leading to difficulties in joke identification and related tasks. Furthermore, current humor datasets are lacking in both joke variety and size, with almost all current datasets having less than 100k jokes. In order to alleviate this issue we compile a collection of over 550,000 jokes posted over an 11 year period on the Reddit r/Jokes subreddit (an online forum), providing a large scale humor dataset that can easily be used for a myriad of tasks. This dataset also provides quantitative metrics for the level of humor in each joke, as determined by subreddit user feedback. We explore this dataset through the years, examining basic statistics, most mentioned entities, and sentiment proportions. We also introduce this dataset as a task for future work, where models learn to predict the level of humor in a joke. On that task we provide strong state-of-the-art baseline models and show room for future improvement. We hope that this dataset will not only help those researching computational humor, but also help social scientists who seek to understand popular culture through humor.

2019

Humor Detection: A Transformer Gets the Last Laugh
Orion Weller | Kevin Seppi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Much previous work has been done in attempting to identify humor in text. In this paper we extend that capability by proposing a new task: assessing whether or not a joke is humorous. We present a novel way of approaching this problem by building a model that learns to identify humorous jokes based on ratings gleaned from Reddit pages, consisting of almost 16,000 labeled instances. Using these ratings to determine the level of humor, we then employ a Transformer architecture for its advantages in learning from sentence context. We demonstrate the effectiveness of this approach and show results that are comparable to human performance. We further demonstrate our model’s increased capabilities on humor identification problems, such as the previously created datasets for short jokes and puns. These experiments show that this method outperforms all previous work done on these tasks, with an F-measure of 93.1% for the Puns dataset and 98.6% on the Short Jokes dataset.

Co-authors

Tirthankar Ghosal 2

Christian Gollan 2

Aakanksha Naik 2

Amanpreet Singh 2

Luca Soldaini 2

Matthias Sperber 2

Griffin Thomas Adams 1

Andrew Blair-Stanek 1

Vladimir Braverman 1

Antoine Chaffin 1

Christopher Challis 1

Benjamin Chang 1

Benjamin Clavié 1

Giulio Corallo 1

Bhavana Dalvi 1

Anita De Waard 1

Alexis Gallagher 1

Oskar Hallström 1

Jordan Hildebrandt 1

Nils Holzenberger 1

Abe Bohan Hou 1

Jeremy Howard 1

Dongwei Jiang 1

Zheng Ping Jiang 1

Daniel Khashabi 1

Joris Kluivers 1

Faisal Ladhak 1

Nicholas Lourie 1

Sean MacAvaney 1

Paolo Papotti 1

Matthias Paulik 1

Matthew E. Peters 1

Fabio Petroni 1

Hendra Setiawan 1

Shreya Sharma 1

Oyvind Tafjord 1

Said Taghadouini 1

E. Shannon Tass 1

Dominic Telaar 1

Benjamin Warner 1

Venues