In-Context Demonstration Selection with Cross Entropy Difference

Large language models (LLMs) can use in-context demonstrations to improve performance on zero-shot tasks. However, selecting the best in-context examples is challenging because model performance can vary widely depending on the selected examples. We present a cross-entropy difference (CED) method for selecting in-context demonstrations. Our method is based on the observation that the effectiveness of in-context demonstrations negatively correlates with the perplexity of the test example by a language model that was finetuned on that demonstration. We utilize parameter efficient finetuning to train small models on training data that are used for computing the cross-entropy difference between a test example and every candidate in-context demonstration. This metric is used to rank and select in-context demonstrations independently for each test input. We evaluate our method on a mix-domain dataset that combines 8 benchmarks, representing 4 text generation tasks, showing that CED for in-context demonstration selection can improve performance for a variety of LLMs.


Introduction
Large language models (LLMs) have been widely successful across many NLP tasks (Bommasani et al., 2022;OpenAI, 2023;Bubeck et al., 2023).The primary method for LLMs to adapt to new tasks has been using in-context learning, where a few examples and labels are provided as input to the model (Agrawal et al., 2022).This simple approach has shown large improvements over zero shot settings, and even outperformed finetuning methods when the training dataset is small.However, the model's performance can be greatly influenced by which in-context demonstrations (ICDs) are selected into the prompt (Lu et al., 2022;Zhao et al., 2021a;Min et al., 2022).
Selecting the best in-context demonstrations can be challenging.The variance in performance of similar demonstrations can be large, and the selected examples can introduce unfavorable prior biases on the output label space (Zhao et al., 2021b).The naive approach is to randomly sample demonstrations from the same source dataset.Previous methods for selecting ICDs include simple methods such as selecting nearest neighbors by embedding distance (Liu et al., 2022b) and retrieval-based methods that require training a retriever model (Rubin et al., 2022).This work presents a new method of selecting demonstrations that can be applied to any sized training data, requires minimal model training and outperforms the nearest neighbor baseline on GPT3.5.
We propose a cross entropy difference (CED) method for ICD selection.CED has been used to select in-domain data from large mixed-domain datasets for domain adaptation (Axelrod et al., 2011;Moore and Lewis, 2010;Wang et al., 2018).We borrow this idea to conduct ICD selection.
Specifically, we utilize parameter efficient finetuning to train small models on training data that are used for computing the CED between a test example and every candidate in-context demonstration.The CED scores are used to rank and select in-context demonstrations.We present a theoretical explanation for the effectiveness of CED.CED approximates the gradient alignment between training and test examples.Our analysis builds on previous findings that demonstrations operate as "meta-gradients" and shows that demonstrations with gradients similar to those of test inputs lead to improved performance in downstream tasks (Dai et al., 2022).
We evaluate our proposed CED-ICD selection method on a mixed-domain dataset composed of 8 datasets on 4 tasks: binary classification, multiple choice and extractive question answering and abstractive question answering.We show that downstream model performance using CDS-ICD outperforms nearest neighbor baselines and transfers across models allowing training small models for selection but evaluating test examples on much larger models including GPT-3.5.
The contributions of this work are • We present a method for selecting in-context demonstrations based on cross entropy difference.
• We provide theoretical guidance for why selecting demonstrations based on their gradient alignment with test example is an effective heuristic.
• We evaluate our method on 8 datasets from 4 tasks that show improvements over common baselines.
• We evaluate our method on different sizes of GPT3 models, showing that this method transfer to larger models leading to performance improvements, even on GPT-3.5.

Related Work
Our work combines ideas from three bodies of research: In-context learning (ICL), data selection for domain adaptation and parameter efficient finetuning (PEFT).While in-context learning (Agrawal et al., 2022) has shown very strong results for few-shot settings, recent work has shown that LLMs are very sensitive to the selected examples leading to large variance in performance (Zhao et al., 2021b), sensitivity to the order of examples (Lu et al., 2022) and even lack of sensitivity to the actual labels (Min et al., 2022).Other work has attempted to mitigate these challenges by selecting in-context demonstrations by the nearest neighbor examples in embedding space (Liu et al., 2022b), or training a retrieval mechanism (Rubin et al., 2022).We build on this line of work by proposing a novel selection method that combines the observations from Min et al. (2022) that domain similarity is a key characteristic of good in-context demonstrations, and observations from Gonen et al. (2022) that using perplexity could be a good heuristic for prompt selection.
Previous work on domain adaptation has focused on finding in-domain examples from a large outof-domain dataset to train a model that achieves a better generalization on a target distribution (Moore and Lewis, 2010;Axelrod et al., 2011;Grangier and Iter, 2022).Data selection is intended to maximize the distributional similarity between a training dataset and test dataset.However, cross entropy difference has not been used previously at the example granularity to rank the "in-domainness" of training data in reference to just a single target example.We propose a natural extension of this framework for selecting demonstrations that are "in-domain" for a test input, which we demonstrate is an effective metric for selecting demonstrations for in-context learning.
Parameter efficient finetuning (PEFT) proposes a class of methods for augmenting model parameters with a small number of additional parameters that can be trained and stored efficiently (Lester et al., 2021;Li and Liang, 2021;Liu et al., 2022a;Hu et al., 2022).However, PEFT is usually used independently from in-context learning.Liu et al. (2022a) report that in-context demonstrations have not been helpful in combination with PEFT.Sun et al. (2023) does report some settings where PEFT and ICL can be combined, but only under specific task conditions.We report similar findings, that in-context demonstrations do not improve PEFT models when selected randomly, however, we do see improvements in PEFT performance when combined with CED for selecting in-context demonstrations during both training and inference time.We also utilize the ability of a PEFT model, T-Few (Liu et al., 2022a), to train on very few examples to be able to effectively compute CED scores without overfitting to the target domain, which is possible even on large datasets (Iter and Grangier, 2021).

Methodology
We propose a method for selecting in-context demonstrations (ICDs) by finding the training data that would minimize the perplexity of a test example, if a language model were finetuned on that training example.This approach stems from previous findings that in-context examples may act as a type of meta-gradient on the frozen LLM (Dai et al., 2022) and the assumption that models perform better on in-domain test data.Interestingly, Min et al. (2022) found that large language models are not sensitive to the correctness of labels, showing that in-context demonstrations may be effective even when the labels are incorrect.However, they did find that having examples with labels that are sampled from the same label space is important.This suggests that large language models do not necessarily learn reasoning from in-context demonstrations, but do look for markers of what terms may be "in-domain" for the test example.As we show in the following sections, our method of using cross entropy difference finds the demonstrations that appear most likely to be from the same domain as the test example.
3.1 ICDs as Meta-Gradients Dai et al. (2022) describes in-context demonstrations as "implicit finetuning", where a component of the attention mechanism can be interpreted as a "meta-gradient".This formulation suggests that training directly on the in-context demonstration would have a similar effect to in-context learning with an LLM.Under this interpretation, the best selection strategy would be to choose examples from the train set that, if the model were to be trained on these examples, would result in the lowest loss on the test example.This strategy of decreasing the loss of a test example to measure domain similarity has been shown to correlate with performance on the downstream tasks in a domain adaptation setting (Grangier and Iter, 2022;Axelrod et al., 2011;Moore and Lewis, 2010).This observation has also been applied recently to selecting prompts and instructions for LLM tasks (Gonen et al., 2022).We apply the principle to the problem of in-context demonstration.
Data selection for domain adaptation can be applied to data selection for in-context learning.A standard approach for domain adaptation is by cross-entropy difference to score training data from a mixed-domain set based on how "close" each example is to the test domain.(Axelrod et al., 2011;Moore and Lewis, 2010) Generally, large language models are trained to minimize the negative log likelihood of a token given some context C by training on samples from a dataset D. The parameters of the model are represented by θ.
In-context learning is the setting where the model weights θ are frozen and the only way to minimize the loss is to select an optimal context.In this work we also constrain the context to be examples from the training data.Note that the more general case is often referred to as prompt learning or prompt engineering where the context can include any natural language including instructions, descriptions in addition to examples.
We define a class of selection methods W, were each function in the set outputs a subset of the training dataset and the size of the subset is at most k, where k is the number of shots to include in the context.The selection may condition on the input so it is not restricted to selecting a single in-context demonstration for all test examples.The optimal selection method W * is defined as the selection method that minimizes the loss on the test domain.
(2) (Dai et al., 2022) shows that in-context examples can be interpreted as a meta-gradient with respect to the model parameters of the LLM.They define an approximation to the standard attention head Attn(V, K, q) as linear attention that removes the softmax and scaling factor.V , K, and q are the value, key and query respectively and correspond to attention weight matrices, W V and W K .We omit the full derivation from (Dai et al., 2022) but include the expanded form of the linear attention in line 2 of Equation 3. q is the attention query vector, q = W Q x, and the input is [X; X ′ ] where X ′ is the in-context demonstration that is concatenated to the input X. (Dai et al., 2022) rewrites the linear attention head weight matrix as a reparameterization of the zero shot attention head W ZSL where the delta applied to the weights depends on the original wieghts and the in-context demonstration.
(3) ∆W ICL can be seen as an update that is applied to the zero shot weights of the attention mechanism W ZSL .Here we see that including in-context demonstrations is akin to finetuning the LLM on the selected demonstrations.
If we can only modify the loss of the large language model by selecting in-context examples and these examples act as a meta-gradient on the language model, the optimal selection would be the training example with a gradient most similar to test example.Computing similarities between gradients would be computationally expensive given the size of large language models and we can not compute the gradient of a test example because the model can not access the labels of test instances.Cross entropy difference (Axelrod et al., 2011), used in data selection for domain adaptation, has been show to be effective at selecting in-domain examples using the perplexity of the input features without the label.Grangier (2019); Wang et al. (2020) describe cross entropy difference as an approximation of the dot product between the gradient of the target domain and the gradient of a single training example.
Here the cross entropy difference is approximating the gradient alignment between a single training example and a target domain.CED is simply defined as the difference between the log probabilities of a text span y evaluated on two models, a base model and a target specific model.The base model represents the background distribution for which we can use any pretrained language model.The target model represents the distribution of a target domain.Unlike cross entropy difference, incontext learning is input specific rather than dataset specific.To adapt the CED method to in-context demonstration selection, we need a model that is finetuned on a single example.In Section 3.2 we describe how we are able to finetune such a model with parameter efficient finetuning (PEFT) without overfitting to the single example and limiting the space requirements to store independent parameters per training example.
Equations 2 and 3 say that we want to find the examples that would minimize the loss if used as finetuning data.Equations 4 states that examples that have a gradient most similar to the actual test data can be approximated by finding the examples that most increase the likelihood of a test example.This provides the motivation for using CED to select ICDs.In the next section we describe in depth how to train models for each single-trainingexample domain and score the training data for selecting in-context demonstrations.

Cross-Entropy Difference for In-Context Demonstration Selection
Given a training set D train = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )} with n examples, where x i is the i-th input and y i is the corresponding label, the goal of few-shot learning is to learn a function f : X → Y that can accurately predict the label y for a new input x, given only a small number of training examples.For simplicity of analysis, we focus on the case were only 1 demonstration is selected.This is especially useful for scenarios where each example is long, such as background documents.We leave multiple-demonstration selection to future investigation.
For each x i in D train , a separate model is trained on the language modeling objective, producing n models, M 1 , ..., M n .Given a test example x T , we apply each M i to compute x T 's perplexity L(M i (x T )).We then select the training sample associated with the language model giving the lowest perplexity as the in-context demonstration for x T : Unlike the domain adaptation setting, rather than scoring all the training data using a single in-domain model, each training example is treated as its own domain.Each test example can be scored for "in-domain-ness" across all training examples.
To train each model on a single example, we use the (IA) 3 PEFT method with a T-Few 0.3B parameter model (Liu et al., 2022a).The effectiveness of PEFT to train a model on a small dataset without catastrophic forgetting allows us to train a model on a single example.The model is trained for multiple epochs and a small development set is used to test for overfitting and early stopping.Also, since a small fraction of parameters are updated, storing each model only requires 2MB on disk.

Experiments
We evaluate CED-ICD selection on both small models and the transfer of the selection method to larger models, including GPT-3.5.We evaluate the selection method in a mixed-domain setting where random demonstrations are not trivially in-domain.We do not provide task or dataset labels as input to the selection model.We show in Section 5.2, both CED and the nearest neighbors baseline do not exclusively select in-domain demonstrations in the mixed domain setting.In fact, we show in Section 5.2 that out-of-domain examples may also be strong in-context demonstrations.In practical settings, a single LLM may be used for multiple tasks and there may not be labels for the task type, especially with the more common chat interfaces.We find this mixed-domain setting to better reflect these realistic challenges.

Datasets and Models
To evaluate the ICD-CED selection method, we measure the performance of several data selection methods on 8 datasets from 4 tasks; binary classification(BoolQ, Clark et al. (2019) 2018)).All tasks are cast as a text generation task, in accordance to the evaluation used in UnifiedQA (Khashabi et al., 2020b).Binary classification and multiple choice are measured by accuracy, extractive QA is measured by F1 score based on the generated tokens and abstractive QA is measured by RougeL.
We combine these 8 datasets to create a larger mixed-domain dataset.We sample 32 examples from each dataset to create a medium-sized fewshot dataset with total of 256 training examples.We evaluate each dataset independently but with in-context demonstrations that can be selected from any dataset.
We evaluate 2 settings, (1) small model performance combining PEFT with in-context learning and (2) in-context learning on LLMs.Our smaller model is T-Few 0.3B model (Liu et al., 2022a).Previous results don't report ICL performance because the authors did not find improvements from including ICL examples, however, as we show in our empirical results, T-Few can benefit from in-context learning if high quality demonstrations are selected.Further improvements are realized by finetuning T-Few with selected ICDs instead of random.For LLMs, we evaluate 3 sizes of GPT-3 (Babbage, Curie and Davinci (davinci-003)) (Ouyang et al., 2022).
We evaluate the following model settings, with the name corresponding to the rows in

GPT-3 ICL randomly selects in-context examples similar to T-Few PEFT + ICL.
GPT-3 + NN uses OpenICL (Wu et al., 2023) to retrieve the most similar example from the test set as the in-context example.GPT-3 + CED is our proposed model which selects in-context demonstrations using CED scores.
In-context demonstrations that do not fit entirely into the T-Few context window are truncated in the "background" section of the input exclusively, to keep the question, answer choices and answer intact.A simple prompt is used for GPT requests that labels the sections as "background", "question", "answer" and "example".We found that performance dramatically improved for binary classification by including an instruction to answer with a yes or no answer.

Results
Our results show that selecting in-context demonstrations using cross-entropy difference (CED) both outperforms baselines on a small trainable model and transfers to larger models, even improving results on GPT3-Davinci003.port that "[i]n preliminary experiments, we found that T0 was not able to perform few-shot ICLperformance actually decreased as we increased the number of in-context examples", which seems to be the case using random in-context demonstrations.However, when incorporating stronger ICD selection methods, we show that performance does improve on NQ-BoolQ, NarrativeQA, Squad2, NewsQA and RACE.We found that T-Few does not perform well with in-context demonstrations if they are not included in the finetuning phase.When finetuning with in-context demonstrations, we evaluated both random ICD selection and CED selection.We found that on some datasets, we get further improvements by using CED selection during training as well as inference, which we expected as the training data will have more examples where the in-context demonstration is helpful for label prediction.
We report oracle in-context demonstration selection, the performance of an ideal example selector given the training data.In this setting, we evaluate every training example as an in-context demonstration and report the metric as the average of the best scores per example.Evaluating generated text requires a verbalizer to map the text to the labels for some tasks.Due to this mapping and metrics that do not directly correlate with loss, such as Rouge and token matching F1, we report both oracle scores based on selecting in-context demonstrations by the lowest loss and by the highest task metric performance.The latter is the true oracle performance but the former suggests that there may be some limitations to the extent that a cross-entropy based model may approximate the downstream performance on a task.
Oracle results show that there is still a large gap between our method and the oracle, showing that there may be many opportunities to improve smaller model performance with better selection methods.However, Figure 1 shows that very few examples yield substantial improvements over the average performance, meaning that while in-context demonstrations may come from a small set, statistical outliers may compound to make a significant improvement but a predictive method for determining which examples are outliers may not exist.Similar figures for all datasets are in the appendix.
Ultimately, in-context demonstrations are best used on large language models that can not be finetuned.Although our proposed selection method is based on the perplexities measured by smaller and finetuned models, we show that our selection method transfers to large models, including GPT-Davinci003.These results are reported in Table 2. Our proposed method for selection outperforms the baseline of using nearest neighbor retrieval on macro average across 8 datasets and on each of the 3 GPT model sizes evaluated.

Analysis
We analyze the quality of cross entropy difference selection by computing the rank of selected demonstrations compared to an oracle.We also explore the presence of strong in-context demonstrations selected from out-of-domain data, compared to indomain data.

Ranking Selected ICDs
Oracle experiments provide a full ranking of all training data as in-context demonstrations.Table 3 shows the average rank of the top 1 selected in-context demonstration per dataset and average, comparing between CED selection and nearest neighbor selection.CED is better at selecting incontext demonstrations as a measured by the oracle ranking, 0 is the highest rank and 255 is lowest.The average rank of a CED selected demonstration is 16 out of 256.select more in-domain demonstrations.In-domain demonstrations are a subset of in-task demonstrations so in-task selection percentage is always larger than in-task but CED has a larger proportion of in-domain selections indicating that CED is better at distinguishing the domain format even when there are other datasets that have a similar format or task structure.Table 4 reports the percentage of oracle best incontext demonstrations that appear in each subset of in-domain, in-task and out-of-domain demonstrations.Different in-context demonstrations can result in the same output and different outputs may score the same on the final metric so the oracle best in-context demonstration may appear both in-domain and out-of-domain.Interestingly, this table shows that an oracle selection method that only has access to out-of-domain demonstrations can still achieve the best performance from this model on 98% of examples, showing that out-ofdomain selection of in-context demonstrations can be highly effective.This suggests that for datasets that have very few or no demonstrations, out-ofdomain demonstrations may still improve model performance.

Discussion
This work shows that cross entropy difference can be used as a heuristic for selecting in-context demonstrations for in-context learning.We motivate our approach by linking previous observations that in-context demonstrations operate as metagradients on a frozen large language model to the data selection literature which has shown that CED is a method that selects examples that have a gradient most similar to the in-domain examples.We present a method for effectively using parameter efficient finetuning on a single example to estimate the "in-domain" distribution of a target example.
We empirically show that we can use smaller models to compute CED scores and that this selection method effectively transfers to large models, such as the 175 billion parameter GPT3, and is able to improve performance over baseline selection methods.This work presents some insight into how large language models use in-context demonstrations but much is still not understood.In particular, currently, in-context learning is an emergent ability in most large language models and is not explicitly learned as part of the training process.Better understanding of how large language models utilize in-context demonstrations may motivate future work on improving training of models for better in-context learning or enabling in-context learning for smaller models that can be more cheaply finetuned.

Figure 1 :
Figure 1: Losses for each in-context demonstration including both in-domain and out-of-domain examples for SQuAD2.Examples below the red line outperform the average in-domain performance.

Figure 2 :
Figure 2: Losses for each in-context demonstration including both in-domain and out-of-domain examples for all datasets.Examples below the red line outperform the average in-domain performance.

Table 1 .
BaselinesT-Few PEFT is the standard parameter efficient finetuning setting where the model is finetuned on all the training data and inference does not include in-context demonstrations.T-Few PEFT + ICL includes randomly selected in-context demonstrations at inference.T-Few PEFT + Oracle evaluates all available ICDs for each test example and reports the highest possible score as an upper bound.

Table 1 :
Table 1 reports the results of different selection methods on the T-Few 300 million parameter model.Parameter efficient finetuning (PEFT) is a strong baseline that finetunes T-Few on the full set of training data, which is a total of 256 training examples.PEFT acheives the best results on T-Few for both BoolQ and NaturalQA datasets.Liu et al. (2022a) re-T-Few results on 8 datasets.Metrics are denoted next to dataset name.

Table 2 :
GPT-3 results using random, nearest neighbor and CDS in-context demonstration selection.

Table 3 :
Average rank of the top 1 selected in-context demonstration for nearest neighbor selection and cross entropy difference selection.Rank is computed as the position is the full rank against all other in-context examples, computed using an oracle evaluated on the final metric for each dataset.

Table 4 :
The percentage of selected in-context demonstrations that are in-domain to the inference task is reported on the left.On the right, we report the percentage of oracle best in-context demonstrations that appear in each category, of in-domain, in-task and out-of-domain.Different in-context demonstrations can result in the same output and different outputs may score the same on the final metric so the oracle best in-context demonstration may appear both in-domain and out-of-domain.