Michael Bendersky - ACL Anthology

Michael Bendersky

2025

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Jianhong Tu | Zhuohao Ni | Nicholas Crispino | Zihao Yu | Michael Bendersky | Beliz Gunel | Ruoxi Jia | Xin Liu | Lingjuan Lyu | Dawn Song | Chenguang Wang
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

2024

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
Zhuowan Li | Cheng Li | Mingyang Zhang | Qiaozhu Mei | Michael Bendersky
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Retrieval Augmented Generation (RAG) has been a powerful tool for Large Language Models (LLMs) to efficiently process overly lengthy contexts. However, recent LLMs like Gemini-1.5 and GPT-4 show exceptional capabilities to understand long contexts directly. We conduct a comprehensive comparison between RAG and long-context (LC) LLMs, aiming to leverage the strengths of both. We benchmark RAG and LC across various public datasets using three latest LLMs. Results reveal that when resourced sufficiently, LC consistently outperforms RAG in terms of average performance. However, RAG’s significantly lower cost remains a distinct advantage. Based on this observation, we propose Self-Route, a simple yet effective method that routes queries to RAG or LC based on model self-reflection. Self-Route significantly reduces the computation cost while maintaining a comparable performance to LC. Our findings provide a guideline for long-context applications of LLMs using RAG and LC.

It’s All Relative! – A Synthetic Query Generation Approach for Improving Zero-Shot Relevance Prediction
Aditi Chaudhary | Karthik Raman | Michael Bendersky
Findings of the Association for Computational Linguistics: NAACL 2024

Large language models (LLMs) have shown promising ability to generate synthetic query-document pairs by prompting with as few as 8 demonstrations. This has enabled building better IR models, especially for tasks with no training data. Typically, such synthetic query generation (QGen) approaches condition on an input context (e.g. a text document) and generate a query relevant to that context, or condition the QGen additionally on the relevance label (e.g. relevant vs irrelevant) to generate queries across relevance buckets. However, we find that such QGen approaches are sub-optimal as they require the model to reason about the desired label and the input from a handful of examples. In this work, we propose to reduce this burden of LLMs by generating queries simultaneously for different labels. We hypothesize that instead of asking the model to generate, say, an irrelevant query given an input context, asking the model to generate an irrelevant query relative to a relevant query is a much simpler task. Extensive experimentation across nine IR datasets shows that synthetic queries generated in such a fashion translates to better downstream performance.

Explanation-aware Soft Ensemble Empowers Large Language Model In-context Learning
Yue Yu | Jiaming Shen | Tianqi Liu | Zhen Qin | Jing Nathan Yan | Jialu Liu | Chao Zhang | Michael Bendersky
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks with a few demonstration examples via in-context learning. Common strategies to boost such “in-context” learning ability are to ensemble multiple model decoded results and require the model to generate an explanation along with the prediction. However, these models often treat different class predictions equally and neglect the potential discrepancy between the explanations and predictions. To fully unleash the power of explanations, we propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs. We design two techniques, explanation-guided ensemble, and soft probability aggregation, to mitigate the effect of unreliable explanations and improve the consistency between explanations and final predictions. Experiments on seven natural language understanding tasks and four varying-size LLMs demonstrate the effectiveness of our proposed framework.

LaMP: When Large Language Models Meet Personalization
Alireza Salemi | Sheshera Mysore | Michael Bendersky | Hamed Zamani
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper highlights the importance of personalization in large language models and introduces the LaMP benchmark — a novel benchmark for training and evaluating language models for producing personalized outputs. LaMP offers a comprehensive evaluation framework with diverse language tasks and multiple entries for each user profile. It consists of seven personalized tasks, spanning three text classification and four text generation tasks. We additionally propose two retrieval augmentation approaches that retrieve personal items from each user profile for personalizing language model outputs. To this aim, we study various retrieval models, including term matching, semantic matching, and time-aware methods. Extensive experiments on LaMP for zero-shot and fine-tuned language models demonstrate the efficacy of the proposed retrieval augmentation approach and highlight the impact of personalization in various natural language tasks.

PRewrite: Prompt Rewriting with Reinforcement Learning
Weize Kong | Spurthi Hombaiah | Mingyang Zhang | Qiaozhu Mei | Michael Bendersky
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Prompt engineering is critical for the development of LLM-based applications. However, it is usually done manually in a “trial and error” fashion that can be time consuming, ineffective, and sub-optimal. Even for the prompts which seemingly work well, there is always a lingering question: can the prompts be made better with further modifications?To address these problems, we investigate automated prompt engineering in this paper. Specifically, we propose PRewrite, an automated method to rewrite an under-optimized prompt to a more effective prompt. We instantiate the prompt rewriter using an LLM. The rewriter LLM is trained using reinforcement learning to optimize the performance on a given downstream task. We conduct experiments on diverse benchmark datasets, which demonstrates the effectiveness of PRewrite.

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
Zhen Qin | Rolf Jagerman | Kai Hui | Honglei Zhuang | Junru Wu | Le Yan | Jiaming Shen | Tianqi Liu | Jialu Liu | Donald Metzler | Xuanhui Wang | Michael Bendersky
Findings of the Association for Computational Linguistics: NAACL 2024

Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets.We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP).Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10.Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity.

Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing
Le Yan | Zhen Qin | Honglei Zhuang | Rolf Jagerman | Xuanhui Wang | Michael Bendersky | Harrie Oosterhuis
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The powerful generative abilities of large language models (LLMs) show potential in generating relevance labels for search applications. Previous work has found that directly asking about relevancy, such as "*How relevant is document A to query Q?*”, results in suboptimal ranking. Instead, the pairwise-ranking prompting (PRP) approach produces promising ranking performance through asking about pairwise comparisons, e.g., "*Is document A more relevant than document B to query Q?*”. Thus, while LLMs are effective at their ranking ability, this is not reflected in their relevance label generation.In this work, we propose a post-processing method to consolidate the relevance labels generated by an LLM with its powerful ranking abilities. Our method takes both LLM generated relevance labels and pairwise preferences. The labels are then altered to satisfy the pairwise preferences of the LLM, while staying as close to the original values as possible. Our experimental results indicate that our approach effectively balances label accuracy and ranking performance. Thereby, our work shows it is possible to combine both the ranking and labeling abilities of LLMs through post-processing.

Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning
Kazuma Hashimoto | Karthik Raman | Michael Bendersky
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new tasks. Previous studies have shown that using LLMs’ outputs as labels is effective in training models to select demonstrations. Such a label is expected to estimate utility of a demonstration in ICL; however, it has not been well understood how different labeling strategies affect results on target tasks. This paper presents an analysis on different utility functions by focusing on LLMs’ output probability given ground-truth output, and task-specific reward given LLMs’ prediction. Unlike the previous work, we introduce a novel labeling method, incremental utility, which estimates how much incremental knowledge is brought into the LLMs by a demonstration. We conduct experiments with instruction-tuned LLMs on binary/multi-class classification, segmentation, and translation across Arabic, English, Finnish, Japanese, and Spanish. Our results show that (1) the probability is effective when the probability values are distributed across the whole value range (on the classification tasks), and (2) the downstream metric is more robust when nuanced reward values are provided with long outputs (on the segmentation and translation tasks). We then show that the proposed incremental utility further helps ICL by contrasting how the LLMs perform with and without the demonstrations.

MLT-DR: Multi-Lingual/Task Demonstration RetrievalAn Attempt towards Generalized Retriever for In-Context Learning
Kazuma Hashimoto | Arjun Reddy Akula | Karthik Raman | Michael Bendersky
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

This paper presents Multi-Lingual/Task Demonstration Retrieval (MLT-DR) for in-context learning with Large Language Models (LLMs).Our goal is to investigate how dense demonstration retrieval models are generalized across languages and tasks.We first convert 81 tasks into a common format, covering various languages, task types, and domains.For 8 English-based tasks among them, we use machine translation to create synthetic multi/cross-lingual tasks, by translating the examples into non-English languages to explicitly cover more than 130 languages.We then use an instruction-tuned LLM to estimate utility of demonstrations for all the tasks to train the demonstration retrieval models.In our experiments, we show an interesting counterintuitive observation; to compute embeddings of demonstrations, using both the input and ground-truth output hurts the generalization ability of the retriever on unseen tasks whose output space is quite different from those in the seen task set.We also examine that our retriever robustly works even with LLMs that we did not touch during the development of the models.The retrieval models’ checkpoints are publicly available at URL-available-upon-publication.

Bridging the Preference Gap between Retrievers and LLMs
Zixuan Ke | Weize Kong | Cheng Li | Mingyang Zhang | Qiaozhu Mei | Michael Bendersky
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have demonstrated superior results across a wide range of tasks, and Retrieval-augmented Generation (RAG) is an effective way to enhance the performance by locating relevant information and placing it into the context window of the LLM. However, the relationship between retrievers and LLMs in a RAG is still under-investigated. Most existing work treats the retriever and the LLM as independent components and leaves a gap between retrieving human-”friendly” information and assembling a LLM-”friendly” context. In this work, we examine a novel bridge mechanism. We validate the ranking and selection assumptions of retrievers in the context of RAG and propose a framework that chains together supervised and reinforcement learning to train a bridge model that optimizes the connection between the retriever and the LLM. Empirical results demonstrate the effectiveness of our method in both question-answering and personalized generation tasks.

Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels
Honglei Zhuang | Zhen Qin | Kai Hui | Junru Wu | Le Yan | Xuanhui Wang | Michael Bendersky
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Zero-shot text rankers powered by recent LLMs achieve remarkable ranking performance by simply prompting. Existing prompts for pointwise LLM rankers mostly ask the model to choose from binary relevance labels like “Yes” and “No”. However, the lack of intermediate relevance label options may cause the LLM to provide noisy or biased answers for documents that are partially relevant to the query. We propose to incorporate fine-grained relevance labels into the prompt for LLM rankers, enabling them to better differentiate among documents with different levels of relevance to the query and thus derive a more accurate ranking. We study two variants of the prompt template, coupled with different numbers of relevance levels. Our experiments on 8 BEIR data sets show that adding fine-grained relevance labels significantly improves the performance of LLM rankers.

Multilingual Fine-Grained News Headline Hallucination Detection
Jiaming Shen | Tianqi Liu | Jialu Liu | Zhen Qin | Jay Pavagadhi | Simon Baumgartner | Michael Bendersky
Findings of the Association for Computational Linguistics: EMNLP 2024

The popularity of automated news headline generation has surged with advancements in pre-trained language models. However, these models often suffer from the “hallucination” problem, where the generated headline is not fully supported by its source article. Efforts to address this issue have predominantly focused on English, using over-simplistic classification schemes that overlook nuanced hallucination types. In this study, we introduce the first multilingual, fine-grained news headline hallucination detection dataset that contains over 11 thousand <article, headline> pairs in 5 languages, each annotated with detailed hallucination types by experts. We conduct extensive experiments on this dataset under two settings. First, we implement several supervised fine-tuning approaches as preparatory solutions and demonstrate this dataset’s challenges and utilities. Second, we test various large language models’ in-context learning abilities and propose two novel techniques, language-dependent demonstration selection and coarse-to-fine prompting, to boost the few-shot hallucination detection performance in terms of the example-F1 metric. We release this dataset to foster further research in multilingual, fine-grained headline hallucination detection.

Predicting Text Preference Via Structured Comparative Reasoning
Jing Nathan Yan | Tianqi Liu | Justin Chiu | Jiaming Shen | Zhen Qin | Yue Yu | Charumathi Lakshmanan | Yair Kurzion | Alexander Rush | Jialu Liu | Michael Bendersky
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Comparative reasoning plays a crucial role in predicting text preferences; however, large language models (LLMs) often demonstrate inconsistencies in their reasoning, leading to incorrect preference predictions. While approaches like Chain-of-Thought improve accuracy in many settings, they struggle to consistently distinguish the similarities and differences of complex texts. We introduce SC², a model that prompts LLMs to predict text preferences by generating structured intermediate comparisons. SC² begins by proposing aspects for comparison, followed by generating textual comparisons under each aspect. We select consistent comparisons with a pairwise comparator that ensures each comparison of a given aspect clearly distinguishes differences between texts, significantly reducing hallucination and improving consistency. Our empirical studies across various NLP tasks, including summarization, retrieval, and automatic rating, demonstrate that SC²‘s enhanced performance in text preference prediction is significant.

PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs
Rongzhi Zhang | Jiaming Shen | Tianqi Liu | Haorui Wang | Zhen Qin | Feng Han | Jialu Liu | Simon Baumgartner | Michael Bendersky | Chao Zhang
Findings of the Association for Computational Linguistics: ACL 2024

Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. However, traditional KD techniques face specific challenges when applied to LLMs, including restricted access to LLM outputs, significant teacher-student capacity gaps, and the inherited mis-calibration issue. In this work, we present PLaD, a novel preference-based LLM distillation framework. PLaD exploits the teacher-student capacity discrepancy to generate pseudo-preference pairs where teacher outputs are preferred over student outputs. Then, PLaD leverages a ranking loss to re-calibrate the student’s estimation of sequence likelihood, which steers the student’s focus towards understanding the relative quality of outputs instead of simply imitating the teacher. PLaD bypasses the need for access to teacher LLM’s internal states, tackles the student’s expressivity limitations, and mitigates the student mis-calibration issue. Through extensive experiments on two sequence generation tasks and with various LLMs, we demonstrate the effectiveness of our proposed PLaD framework.

2023

Creator Context for Tweet Recommendation
Spurthi Amba Hombaiah | Tao Chen | Mingyang Zhang | Michael Bendersky | Marc Najork | Matt Colen | Sergey Levi | Vladimir Ofitserov | Tanvir Amin
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

When discussing a tweet, people usually not only refer to the content it delivers, but also to the person behind the tweet. In other words, grounding the interpretation of the tweet in the context of its creator plays an important role in deciphering the true intent and the importance of the tweet. In this paper, we attempt to answer the question of how creator context should be used to advance tweet understanding. Specifically, we investigate the usefulness of different types of creator context, and examine different model structures for incorporating creator context in tweet modeling. We evaluate our tweet understanding models on a practical use case – recommending relevant tweets to news articles. This use case already exists in popular news apps, and can also serve as a useful assistive tool for journalists. We discover that creator context is essential for tweet understanding, and can improve application metrics by a large margin. However, we also observe that not all creator contexts are equal. Creator context can be time sensitive and noisy. Careful creator context selection and deliberate model structure design play an important role in creator context effectiveness.

2022

QUILL: Query Intent with Large Language Models using Retrieval Augmentation and Multi-stage Distillation
Krishna Srinivasan | Karthik Raman | Anupam Samanta | Lingrui Liao | Luca Bertelli | Michael Bendersky
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Large Language Models (LLMs) have shown impressive results on a variety of text understanding tasks. Search queries though pose a unique challenge, given their short-length and lack of nuance or context. Complicated feature engineering efforts do not always lead to downstream improvements as their performance benefits may be offset by increased complexity of knowledge distillation. Thus, in this paper we make the following contributions: (1) We demonstrate that Retrieval Augmentation of queries provides LLMs with valuable additional context enabling improved understanding. While Retrieval Augmentation typically increases latency of LMs (thus hurting distillation efficacy), (2) we provide a practical and effective way of distilling Retrieval Augmentation LLMs. Specifically, we use a novel two-stage distillation approach that allows us to carry over the gains of retrieval augmentation, without suffering the increased compute typically associated with it. (3) We demonstrate the benefits of the proposed approach (QUILL) on a billion-scale, real-world query understanding system resulting in huge gains. Via extensive experiments, including on public benchmarks, we believe this work offers a recipe for practical use of retrieval-augmented query understanding.

2020

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling
Jiecao Chen | Liu Yang | Karthik Raman | Michael Bendersky | Jung-Jung Yeh | Yun Zhou | Marc Najork | Danyang Cai | Ehsan Emadzadeh
Findings of the Association for Computational Linguistics: EMNLP 2020

Pre-trained models like BERT ((Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation (Hinton et al., 2015), leading to faster inference. However – as we show here – existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work, we propose DiPair — a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach with speedups of over 350x and minimal quality drop relative to the cross-attention teacher BERT model.

2012

A Dictionary of Wisdom and Wit: Learning to Extract Quotable Phrases
Michael Bendersky | David Smith
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

2011

Joint Annotation of Search Queries
Michael Bendersky | W. Bruce Croft | David A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Co-authors

Mingyang Zhang 4

Honglei Zhuang 3

Simon Baumgartner 2

Kazuma Hashimoto 2

Rolf Jagerman 2

David A. Smith 2

Jing Nathan Yan 2

Arjun Reddy Akula 1

Spurthi Amba Hombaiah 1

Luca Bertelli 1

Aditi Chaudhary 1

Nicholas Crispino 1

W. Bruce Croft 1

Ehsan Emadzadeh 1

Spurthi Hombaiah 1

Charumathi Lakshmanan 1

Donald Metzler 1

Sheshera Mysore 1

Vladimir Ofitserov 1

Harrie Oosterhuis 1

Jay Pavagadhi 1

Alexander M. Rush 1

Alireza Salemi 1

Anupam Samanta 1

Krishna Srinivasan 1

Chenguang Wang 1

Jung-Jung Yeh 1

Rongzhi Zhang 1

Venues