Nearest Neighbor Zero-Shot Inference

Retrieval-augmented language models (LMs) use non-parametric memory to substantially outperform their non-retrieval counterparts on perplexity-based evaluations, but it is an open question whether they achieve similar gains in few- and zero-shot end-task accuracy. We extensively study one such model, the k-nearest neighbor LM (kNN-LM), showing that the gains marginally transfer. The main challenge is to achieve coverage of the verbalizer tokens that define the different end-task class labels. To address this challenge, we also introduce kNN-Prompt, a simple and effective kNN-LM with automatically expanded fuzzy verbalizers (e.g. to expand “terrible” to also include “silly” and other task-specific synonyms for sentiment classification). Across nine diverse end-tasks, using kNN-Prompt with GPT-2 large yields significant performance boosts over strong zeroshot baselines (13.4% absolute improvement over the base LM on average). We also show that other advantages of non-parametric augmentation hold for end tasks; kNN-Prompt is effective for domain adaptation with no further training, and gains increase with the size of the retrieval model.


Introduction
Retrieval-augmented language models (LMs) have access to a non-parametric memory, allowing them to directly access a large external text collection during inference. Previous work has shown that these models substantially outperform their nonretrieval-based counterparts on language modeling tasks (Khandelwal et al., 2020;He et al., 2021;Borgeaud et al., 2021), but it is an open question whether they also achieve similar gains in fewshot and zero-shot end task evaluations (Radford et al., 2019;Brown et al., 2020a). In this paper, we demonstrate that, with some extensions to improve coverage, the performance gains of retrievalaugmented LMs generalize well to a wide range of downstream tasks.
We study the k-nearest neighbors language model (Khandelwal et al., 2020, kNN-LM), which Figure 1: kNN-Prompt incorporates information from a large, heterogeneous corpus to facilitate zero-shot inference. The datastore contains key-value pairs where the key is an encoding of a leftward context and the value is the next token following the context. interpolates a neural LM's output distribution with a nearest-neighbor distribution constructed by retrieving tokens from a corpus using the LM's output embeddings. We are the first to study kNN-LM's zero-shot application to end tasks, and what we find is that applying the technique naïvely only produces marginal improvements (Section 4). The main challenge is that the support of the kNN distribution is sparse (covering at most k tokens, often less), as it only assigns probability mass to nearest neighbors. This means it often entirely misses the tokens that are used to verbalize the output label in the standard application of LMs to zero-shot classification: across the datasets we test, an output label receives nonzero probability under the kNN distribution only 45.8% of the time (see Section 6).
To address this challenge, we introduce kNN-Prompt, a simple and effective method built on kNN-LM for improving zero-shot inference with no further training. Key to our approach are fuzzy verbalizers, which automatically expand the set of tokens corresponding to each output label. For example, in Figure 1, the verbalized label of the negative sentiment is "terrible". Our fuzzy verbalizer also maps "silly" to negative sentiment, allowing the model to better leverage the information available in the kNN distribution.
Extensive experiments (Section 3) show that using kNN-Prompt with a heterogeneous datastore consistently improves an LM's zero-shot abilities on eleven tasks, including sentiment analysis, topic classification, entailment, fact retrieval and question answering. These improvements hold for every model in the GPT-2 family. Furthermore, kNN-Prompt can be adapted to new domains and tasks with no further training (Section 5). With a domainspecific datastore corpus, we achieve comparable or better performance to prompting the LM after domain-adaptive pretraining (Gururangan et al., 2020) on that corpus. To better understand these gains, we conduct a thorough analysis (Section 6), showing that fuzzy verbalizers are essential for leveraging the kNN distribution, the benefits of retrieval increase with retrieval model size, and even relatively small datastores can yield sizeable performance gains if they are tailored to the domain or task.
Overall, our results show how retrieval can benefit zero-shot inference with LMs on a wide variety of tasks, and suggest that applying retrieval with larger models may yield even greater benefits. Code and data are available at github.com/ swj0419/kNN_prompt.

Method
To perform zero-shot prediction on a downstream task using a pretrained language model, we recast the task as language modeling (Radford et al., 2019) by converting each input instance into a natural language prompt (Section 2.1). We then augment the pretrained model with the knearest-neighbors language modeling technique from Khandelwal et al. (2020). To better benefit from the sparse kNN distribution, we introduce fuzzy verbalizers for mapping from the LM's outputs to a distribution over task-specific labels (Section 2.3). Finally, we decode the output from this label distribution using the domain-conditional PMI scoring method of Holtzman et al. (2021).

Prompting and Verbalizers
We address classification problems where an instance consists of an input sequence of tokens x = (x 0 , x 1 , ..., x |x| ) from a vocabulary V and an output label y ∈ Y. The output label set Y may be fixed for the task (text classification) or provided for each instance as a set of expressions in V * (multiple-choice). For example, in the sentiment analysis example in Figure 2, the input is x = "Mr. Tsai is one of world cinema's most gifted artists." The output labels are Y = {y + , y -}, referring to positive and negative sentiment.
To cast the task as language modeling, we deterministically transform each input example x into a prompt p(x). Providing this prompt to an LM yields a probability distribution over continuation sequences z ∈ V * . To extract an output label from these continuations, we apply verbalizers V : y → V * (Schick and Schütze, 2021) which map each output label y ∈ Y to a natural language expression V(y) = z. We can then compute a probability for each label: normalizing over all y ∈ Y. For example, our prompt transformation for sentiment analysis adds It was after the input, and uses the verbalizer V(y + ) = great, V(y -) = terrible, which classifies sentiment according to the relative probabilities of It was great and It was terrible after the input sequence (see Figure 2, bottom left). In the case of multiple-choice problems, our verbalizer is just the identity function.

k-Nearest Neighbors Language Modeling
Following Khandelwal et al. (2020), we augment the LM with a datastore from which it can retrieve tokens that inform its predictions, improving performance without further training.
The datastore is a key-value store generated by running the LM over a corpus of text. Each value is a token w ∈ V from the corpus, and its key is the vector hidden representation at the output layer of the LM running forward on the left context c ∈ V * (call this f(c)). At inference time, when predicting the next token for an input sequence c, the kNN-LM retrieves the k nearest neighbors of c from the datastore according to the distance d(·, f(c)) of their key vectors.  Figure 2: An illustration of kNN-Prompt applying to sentiment analysis tasks. Texts are encoded in the datastore, where each entry consists of a representation of a leftward context and its next token. During inference, a test example is mapped to a prompt form and used to retrieve the k most similar contexts and their next tokens from the datastore. kNN distribution is a multinomial computed on the distance of the text example and similar contexts. The final prediction is formed by combining the kNN distribution with the language model's output distribution.
A softmax over the (negative) distances induces a distribution over the the tokens w i in the nearest neighbor set: where t is a temperature parameter. 2 We can then interpolate this with the original LM as follows: The hyperparameters for the kNN-LM approach are the number k of nearest neighbors, the interpolation constant λ, the temperature t, and the choice of datastore.

Fuzzy verbalizers
One challenge in performing zero-shot inference with LMs on downstream tasks is the choice of verbalizer. On one hand, LMs may be highly sensitive to the particular surface form in ways that are irrelevant to the classification task (Holtzman et al., 2021). On the other hand, for a kNN model, the k nearest neighbor set is sparse and may fail 2 We have added the temperature adjustment in the softmax on top of Khandelwal et al.'s kNN-LM formulation. to cover any of the tokens in the set of verbalizers (i.e., P kNN (V(y)) = 0 for all y ∈ Y), limiting its utility in those cases. To address these issues, we introduce fuzzy verbalizers, which associate each label y with a neighborhood of token sequences around a specific verbalization V(y) ∈ V * .
To do this, we first associate each token v ∈ V with a neighborhood N (v) ⊆ V of similar tokens. In particular, we use v's top-5 most similar words according to the cosine similarity of their GloVe embeddings (Pennington et al., 2014), as well as any of v's synonyms in ConceptNet (Speer et al., 2017). 3 Then, for the purposes of calculating the probability of a verbalized label z = V(y), we treat a prediction of any token in each z i 's neighborhood as a viable substitute for it, marginalizing over N (z i ) at each timestep: This incorporates more information from the LM to inform the induced distribution over labels P FV (y | x), and in the case of a kNN-based model, helps mitigate the effect the sparsity of the kNN distribution has on zero-shot prediction (see Section 6).

Full model
To make a zero-shot prediction for an input x, we first transform it into a prompt p(x) and obtain a distribution over continuations (z) with a kNN-LM: P kNN-LM (z | p(x)). We then convert this to a probability distribution over output labels P(y | p(x)) using a fuzzy verbalizer (Section 2.3, Equation 2). Finally, we output the best label according to the domain-conditional PMI scoring rule (Holtzman et al., 2021): where p is a task-dependent string which is independent of the particular input (generally the local context at the end of the prompt, e.g., we use p = "It was" for sentiment analysis, as shown in Figure 2).

Tasks
We experiment with 11 tasks, including fact retrieval, question answering, topic classification, sentiment analysis, entailment and partisanship classification.
Topic Classification We use the AG News (AGN) and Yahoo! Answers (Yahoo) corpora from Zhang et al. (2015) for topic classification.

Sentiment and Partisanship
We study sentiment analysis using the Rotten Tomatoes (RT) and SST-2 corpora of Socher et al. (2013), movie reviews from Pang and Lee (2005, MR), the customer review dataset from Hu and Liu (2004, CR) consisting of Amazon and Yelp reviews, and the hyperpartisan news detection dataset from Kiesel et al. (2019, HYP), which focuses on classifying whether a text exhibits extreme political views.  (Wang et al., 2018).

Fact Retrieval and Question Answering
We evaluate fact retrieval with the LAMA probe (Petroni et al., 2019), which tests an LM's recovery of factual subject-relation-object triples using a cloze format (e.g., Dante was born in [Mask]). We use test examples where the missing token is at the end of the sentence (suitable for left-to-right LMs) and we report the mean accuracy across all triples. For question answering, we consider CommonsenseQA (Talmor et al., 2019, CQA), consisting of multiple-choice commonsense questions authored by crowd-workers on the basis of knowledge encoded in ConceptNet (Speer et al., 2017). Since the the answers for LAMA and CommonsenseQA can be any string, we perform beam decoding from our LM to produce a set of possible outputs and proceed as in the multiple-choice case.

kNN-Prompt Model Details
Inference Model For our main experiments, we directly use GPT-2 large from Huggingface 4 as our base LM. We consider other model sizes in Section 6.
Retriever Model Following the inference model, we use GPT-2 large to build the datastore. The keys are the 1280-dimensional hidden representations before the final MLP which predicts the token distribution at each timestep, produced using a single forward pass over the datastore corpus. For efficient similarity search, we create a FAISS (Johnson et al., 2019)    Inference Procedure We retrieve k=512 neighbors, soften the kNN distribution with a temperature value of 3 and use an interpolation factor of λ = 0.3. Our primary evaluation is zero-shot. All hyperparameters were chosen on the basis of development experiments (see Section 6 for more details).

Baselines
LM is the result of prompting the base language model (GPT-2 Large), choosing the output label whose verbalizer has the highest probability under the language model P LM (V(y) | p(x)).
LM+PMI is the approach of Holtzman et al.

Experimental Results
Results for zero-shot prediction are in Table 2. kNN-Prompt outperforms all baselines in all tasks, improving over the base LM by 14.1% on average. The gains are particularly pronounced for MR and RT (sentiment analysis on movie reviews), Yahoo (topic classification), and LAMA (fact recovery). For MR and RT, the gains seem to come mostly from PMI calibration. On the other hand, large performance boosts on LAMA only come with the full kNN-Prompt model, which indicates the importance of combining retrieval, fuzzy verbalization, and PMI calibration for this task. Interestingly, the kNN-LM alone yields a fairly small improvement over the base LM (about 1-2% on average). It is not strong enough to outperform LM+PMI even on LAMA, which intuitively should benefit from retrieval. This suggests that the fuzzy verbalizer and PMI calibration methods may help kNN-Prompt better leverage the information in the k-nearest neighbors distribution. We carefully examine possible sources of kNN-Prompt's performance gains in Section 6.
Few-shot inference For a subset of tasks, we additionally compare to a few-shot setting where we prepend four examples uniformly sampled from the training data to the input (Table 3). The demonstration examples are converted to prompt and verbalizer format. We report the mean accuracy and standard deviation with 4 different random seeds. We find that kNN-Prompt consistently outperform baselines, demonstrating that kNN-Prompt is applicable to the few-shot setting as well. We leave further exploration of this phenomenon to future work.

kNN-Prompt for Domain Adaptation
One of the advantages of retrieval-based LMs is that they can be adapted to new domains with no further training.
To test this capability, we replace our heterogeneous datastore (Section 3.2) with domain-specific ones for several tasks. To build these domainspecific datastores, we select Amazon Reviews for CR, CC-NEWS for HYP and Wikitext-103 for LAMA, and encode them separately.
For comparison, we consider domain-adaptive pretraining (Gururangan et al., 2020, DAPT), which further trains the LM on the domain-specific corpus. We train GPT-2 Large on each domain corpus for a single pass, then apply it to downstream tasks using our prompting and verbalizer setup and domain-conditional PMI scoring.
As shown in Table 4, kNN-Prompt performs  (Table 1), domain-specific corpus, and task-specific corpus, respectively. We use IMDB as the domain-specific corpus for MR and RT, and CC-NEWS for AGN. The task-specific corpus is the unlabeled training data of each task. GPT-2 Large is used for both retriever and inference models.  Effect of datastore distribution and size To better understand kNN-Prompt's potential for domain adaptation, we experiment with varying sizes and distributions of the datastore. For each task, we consider three options for the datastore corpus: our heterogeneous corpus (Section 3.2), a domain-specific corpus, and a task-specific corpus constructed from the task's (unlabeled) training data. Each of these data sources exhibits increasing levels of relevance to the task. Figure 3 shows how model performance varies with the choice of datastore across different datastore sizes. For a fixed number of tokens, retrieving from a task-specific datastore is best. Furthermore, token-for-token, adding task-specific data leads to more gains than domain-specific data, which in turn is better than our heterogeneous corpus.  When a sufficient amount of task-specific data is available, using it for the datastore can outperform a vastly larger corpus. For example, for AGN, using 6M tokens of unlabeled training data outperforms using our 465M token heterogeneous corpus. However, in cases where the task training data is smaller, using large heterogeneous data can be more effective, as increasing the number of tokens in the datastore always improves task performance for all of the tasks and data distributions we test. These results suggest that while having a large datastore is beneficial, curating task-specific data can also be an effective way of improving model performance, especially if datastore size is limited (e.g., due to memory constraints).

Analysis
We perform several experiments to understand the contribution of each component of kNN-Prompt and inform our choice of hyperparameters. interpolation (Section 2.2), fuzzy verbalizers (Section 2.3), and PMI scoring (Section 2.4). Table 5 shows the results of ablation experiments analyzing the contribution of each component.
First, we find that the gains from kNN retrieval alone are modest (+1.3%), but much greater once we add fuzzy verbalizers on top of them (+10.4%), exceeding the contribution of the two components independently (with fuzzy verbalizers alone at +5.5%). This supports the argument that fuzzy verbalizers allow the model to make better use of the sparse support of the kNN distribution. Indeed, we find that across all tasks, an output label receives nonzero probability under the kNN distribution in kNN-LM only 45.8% of the time. With fuzzy verbalizers, this increases to 78%.
Second, we find that for the base LM, fuzzy verbalizers bring gains (+5.5%) similar to PMI scoring (+5.3%), but the gains are only partially additive when combining the two techniques (+7.6%). This suggests that by incorporating more varied surface forms into the score for each label, fuzzy verbalizers may partially -but not completely -mitigate the surface form competition problem which PMI scoring was designed to tackle (Holtzman et al., 2021). The effect of PMI scoring is increased, however, when we use fuzzy verbalizers and kNN retrieval together (+14.1% for the full model versus +10.4% for kNN with fuzzy verbalizers), suggesting that the kNN distribution might suffer from more surface form competition problems than the base LM distribution.
kNN retrieval hyperparameters Figure 4 shows how the number of retrieved nearest neighbors (k) and softmax temperature affect model performance on three datasets. In most cases, performance monotonically improves with the number of neighbors when k is smaller than 512 and deteriorates after that. When k < 256, a temperature of 1 performs best, while flattening the distribution is useful when retrieving more neighbors. Overall, using 512 neighbors and a temperature value of 3 performs consistently well across the tasks we test.
Retrieval model size and inference model size Figure 5 shows how performance varies with the size of the retriever and inference models on three tasks. We observe substantial gains as the size of the retriever increases, which hold regardless of inference model size.
It should be noted that a larger retriever leads to a larger datastore and slower retrieval: increasing the retriever size from 125M to 1600M parameters doubles the memory footprint of the datastore, which scales with the size of the retriever model's output embeddings. These computational tradeoffs may inform the retriever size best suited for a particular application.

Related Work
Retrieval-augmented LMs Several studies propose the use of retrieval mechanisms with external datastores to improve language modeling performance (Khandelwal et al., 2020) and opendomain question answering (Izacard and Grave, 2020;Lewis et al., 2020). Other work incorporates retrieval directly into the LM pretraining process (Guu et al., 2020;Borgeaud et al., 2021). Khandelwal et al. (2021) applies nearest neighbor retrieval to conditional sequence generation to improve the quality of machine translation systems. Our work is the first to show that retrieval augmentation, introduced at test time, improves the zero-shot inference of language models on a variety of end tasks.
Zero-shot and few-shot inference Brown et al. (2020b) demonstrate that large LMs can perform zero-shot (given only a prompt) and few-shot learning (using a concatenation of training examples as demonstrations) without any finetuning. Subsequent work further improves their zero-shot and few-shot abilities with calibration (Holtzman et al., 2021;Zhao et al., 2021;Min et al., 2021a), prompt engineering (Lu et al., 2021;Shin et al., 2020) and meta-tuning (Min et al., 2021b;Wei et al., 2022;Zhong et al., 2021). Rubin et al. (2021) and Liu et al. (2021) apply retrieval methods to select incontext learning examples that are semanticallysimilar to a test example for few-shot inference. However, these retrieval methods require access to a large set of labeled data. In contrast, kNN-Prompt only assumes the availability of a heterogeneous unlabeled corpus.

Conclusions and Future Work
We present kNN-Prompt, a new technique to augment LMs with nearest neighbor retrieval for zero-shot inference on end tasks. kNN-Prompt substantially improves zero-shot performance on a wide range of multiple-choice and classification tasks. With a domain-or task-relevant datastore, kNN-Prompt enables efficient domain adaptation with no additional training, and its benefits scale with the size of the retrieval model.
Future work may study the usefulness of kNN-Prompt with larger inference models, which, combined with larger retrieval models, may result in better zero-shot performance. Careful analysis could explore datastore curation methods to balance taskrelevancy, domain generality, and size; datastores could also be compressed for efficient retrieval. Finally, future work may explore the possibility of retrieving and interpolating contexts at different levels of granularity, from tokens and spans to documents.