Retrieval-based Language Models and Applications

Retrieval-based language models (LMs) have shown impressive performance on diverse NLP tasks. In this tutorial, we will provide a comprehensive and coherent overview of recent advances in retrieval-based LMs. We will start by providing preliminaries covering the foundation of LMs (e.g., masked LMs, autoregressive LMs) and retrieval systems (e.g., nearest-neighbor search). We will then detail recent progress in retrieval-based models, focusing on their model architectures and learning approaches. Finally, we will show how retrieval-based LMs are adapted to downstream applications, and extended to multilingual and multi-modal settings. Finally, we will use an exercise to showcase the effectiveness of retrieval-based LMs.


Description
Language models (LMs) such as GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022) have shown impressive abilities in a range of natural language processing (NLP) tasks. However, relying solely on their parameters to encode a wealth of world knowledge requires a prohibitively large number of parameters and hence massive compute, and they often struggle to learn long-rail knowledge (Roberts et al., 2020;Kandpal et al., 2022;Mallen et al., 2022). Moreover, these parametric LMs are fundamentally incapable of adapting over time (De Cao et al., 2021;Lazaridou et al., 2021;Kasai et al., 2022), often hallucinate (Shuster et al., 2021), and may leak private data from the training corpus (Carlini et al., 2021). To overcome these limitations, there has been growing interest in retrieval-based LMs (Guu et al., 2020;Khandelwal et al., 2020;Borgeaud et al., 2022;Izacard et al., 2022b;Min et al., 2022), which incorporate a non-parametric datastore (e.g., text chunks from an external corpus) with their parametric counterparts. Retrieval-based LMs can outperform LMs without retrieval by a large margin with much fewer parameters (Mallen et al., 2022), can update their knowledge by replacing their retrieval corpora (Izacard et al., 2022b), and provide citations for users to easily verify and evaluate the predictions Bohnet et al., 2022).
Previously, retrieval and LMs have been studied mostly separately, and only recently researchers have integrated them and built systems in which retrieval and LMs interact more organically, and a number of retrieval-based LMs have been proposed due to growing interest. They differ in their neural architectures (e.g., the granularity of retrieval units, how to integrate retrieved information), learning algorithms, and different uses in downstream applications. In this tutorial, we aim to provide a comprehensive and coherent overview of recent advances in retrieval-based LMs. We will start by first providing preliminaries covering the foundations of LM (e.g., masked LMs, autoregressive LMs) and retrieval systems (e.g., nearest-neighbor search methods widely used in neural retrieval systems; Karpukhin et al. 2020). We will then focus on recent progress in architectures, learning approaches, and applications of retrieval-based LMs.
A taxonomy of architectures We introduce a taxonomy of architectures of retrieval-based LMs based on a variety of dimensions. Retrieval-based LMs can be categorized by the granularity of retrieved units stored in the datastore: either 1) a chunk of text (Borgeaud et al., 2022;Izacard et al., 2022b), or 2) a token (Khandelwal et al., 2020;Min et al., 2022), or 3) an entity mention (Févry et al., 2020;de Jong et al., 2022). We also plan to cover techniques for refining data stores and improving similarity search (He et al., 2021;Alon et al., 2022). At the same time, retrieval-base LMs can be categorized based on how the retrieved information is integrated with the parametric encoder: 1) whether retrieved components are concatenated with the original input text Guu et al., 2020;Izacard et al., 2022b), 2) whether the retrieved components are latent and integrated into the intermediate layers of Transformers (de Jong et al., 2022;Févry et al., 2020;Borgeaud et al., 2022), or 3) distribution of tokens from the retrieved components and the LMs are interpolated (Khandelwal et al., 2020;Yogatama et al., 2021).
Scalable learning algorithms Then, we discuss the training approaches of retrieval-based LMs. Since a retrieval datastore is typically very large, how to train retrieval-based LMs effectively and efficiently remains challenging. We first discuss pipelined approaches that train retrieval components and LMs separately, either through large-scale pre-training (Izacard et al., 2022a) or multitask instruction tuning . Several other works train retrieval-based LMs with a fixed retrieval module (Borgeaud et al., 2022;Yogatama et al., 2021). We then discuss joint training under reasonable resource requirements: either through in-batch approximations to a full datastore, or updating the datastore with updated parameters asynchronously. The former uses fractions of the full corpus that are carefully designed during joint training de Jong et al., 2022;Min et al., 2022). The latter, on the other hand, aims to use full corpus during training with asynchronous index update for every certain time steps (Izacard et al., 2022b;Guu et al., 2020).
Adaption to downstream tasks After discussing the basic building blocks of retrieval-based LMs, we show how retrieval-based LMs are adapted to downstream applications. We first briefly summarize the two approaches to adapt a model to a new task: zero-shot or few-shot prompting without any parameter updates , and fine-tuning on target task data . We then discuss methods designed to build more powerful retrieval-based LMs for certain downstream tasks, such as dialogue (Shuster et al., 2021), semantic parsing (Pasupat et al., 2021), and machine translation (Khandelwal et al., 2021;Zheng et al., 2021).
Up to this point, our tutorial has mainly focused on retrieving and integrating English plain text. At this end, we will cover recent extensions of retrieval-based LMs beyond English text, including multilingual (Asai et al., 2021), multimodal Yasunaga et al., 2022) and code (Parvez et al., 2021) retrieval. These works often extend dense retrieval models to enable retrieval between heterogeneous input spaces (e.g., cross-lingual, cross-modal) and have shown that referring retrieved knowledge leads to knowledgeintensive generation.
Finally, we will use an exercise to showcase the effectiveness of retrieval-based LMs. We conclude our tutorial by discussing several important questions and future directions, including (1) how we can further improve the scalability of retrievalbased LMs without sacrificing performance, (2) when retrieval-based LMs are particularly useful in the era of rapidly evolving LMs, and (3) what is necessary to enable applications of retrieval-based LMs for more diverse domains. Length This is a 3-hour tutorial.
Target audience The tutorial will be accessible to anyone who has a basic knowledge of machine learning and natural language processing. We think the topic will be of interest to both NLP researchers/students in academia and NLP practitioners in the industry.
Breadth We estimate that 20% of the work covered in this tutorial will be by the presenters and the remaining 80% by others. The papers we will cover are from both academia and industry.
Diversity considerations. The speakers are from two academic institutions with an affiliation with an industry research group, including both a professor and Ph.D. students. Three out of four speakers are female. The methods covered by our tutorials can scale up to various languages or domains, and we also briefly cover several papers focusing on multilingual and expert-domain extensions of the core frameworks. We will reach out to academic communities such as WiNLP 1 and Masakhane 2 to encourage them to attend our tutorial for participation of diverse audiences. Since retrieval-based LMs are alternatives to LMs with a significantly large number of parameters, we expect this tutorial to be especially useful to researchers with modest resources who do no have access to very large models.
An estimate of the audience size Given that language models are now used in a range of NLP tasks and retrieval-based approaches have been applied to diverse domains, we estimate that the number of audiences will be around 150+.
Venues. We prefer ACL due to the growing interest in the area and the travel constraints of some of the speakers. EMNLP is our second preferred choice, and we currently do not consider EACL.
Technical equipment. We would like to have Internet access to show online demos.
Open access We plan to make all teaching material available online and agree to allow the publication of slides and video recordings in the ACL anthology.
Ethical considerations Retrieval-based LMs are often more powerful and parameter-efficient than LMs, and do not require full re-training to update world knowledge, which makes it more energyefficient and can reduce carbon footprints. Prior work also shows that referring to external world knowledge can reduce harmful biases and hallucinations, although retrieval-based LMs can still be plausible sounding but incorrect or non-sensical outputs. We note that, as retrieval-based LMs may retrieve raw data from a corpus, which can leak privacy-sensitive information, especially when they are built on top of a private corpus. We acknowledge this to caution those who manage to apply retrieval-based LMs to privacy-sensitive domains.
Pedagogical material We plan to do some short hands-on exercises to let the audience try different retrieval-based LMs with few-shot prompting using Colab.
Past tutorials.
• ACL 2020 tutorial on Open-domain QA : This tutorial provides comprehensive reviews of open-domain question answering, some of which consist of a retriever and a generative model, while we focus on the recent progress of architectures and learning algorithms of retrieval-based LMs for diverse NLP tasks, not limiting its focus to open-domain QA. Most of the papers will be discussed in this tutorial have been published since the Open-domain QA tutorial three years ago. Moreover, one of the instructors, Danqi was an instructor of this ACL 2020 tutorial.
• SIGIR 2022 tutorial on Recent Advances in Retrieval-Augmented Text Generation : This tutorial focuses mainly on recent retrieval-augmented text generation approaches with a focus on two applications: dialogue and machine translation. Our tutorial puts more emphasis on the architecture and learning methods of retrieval-based LMs that can be applicable to diverse NLP tasks.
in natural language processing and machine learning. Her recent research focuses on question answering, retrieval-based LMs, multilingual NLP, and entity-aware representations. She received the IBM Fellowship in 2022. She is a lead organizer of the Workshop on Multilingual Information Access (NAACL 2022) and serves as an area chair in question answering at EACL 2023.