Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning

Large Language Models (LLMs) have recently gained the In-Context Learning (ICL) ability with the models scaling up, allowing them to quickly adapt to downstream tasks with only a few demonstration examples prepended in the input sequence. Nonetheless, the current practice of ICL treats all demonstration examples equally, which still warrants improvement, as the quality of examples is usually uneven. In this paper, we investigate how to determine approximately optimal weights for demonstration examples and how to apply them during ICL. To assess the quality of weights in the absence of additional validation data, we design a masked self-prediction (MSP) score that exhibits a strong correlation with the final ICL performance. To expedite the weight-searching process, we discretize the continuous weight space and adopt beam search. With approximately optimal weights obtained, we further propose two strategies to apply them to demonstrations at different model positions. Experimental results on 8 text classification tasks show that our approach outperforms conventional ICL by a large margin. Our code are publicly available at https:github.com/Zhe-Young/WICL.


Introduction
With the increase in model size and training corpus, Large Language Models (LLMs) have demonstrated in-context learning (ICL) capabilities (Radford et al., 2019;Brown et al., 2020;Wei et al., 2022;Dong et al., 2022).Unlike traditional finetuning, which requires updating model parameters, ICL allows LLMs to adapt to downstream tasks while keeping the parameters fixed.To enable ICL, a small number of examples will be prepended before the query to form a prompt.The prompt is then fed into the LLM for prediction.Numerous studies have shown the effectiveness of ICL, demonstrating that it can achieve, and sometimes even surpass, the performance of fine-tuned models.The performance of ICL is closely related to the quality of demonstration examples (Wu et al., 2022;Gonen et al., 2022).Therefore, various methods have been proposed to select high-quality examples.For example, Liu et al. (2022) selects nearest examples based on semantic similarity, Zhang et al. (2022) employs reinforcement learning for example selection, and Sorensen et al. (2022) maximizes mutual information on an unlabeled validation set for choosing templates.However, these methods assume the availability of a large number of examples for selection, which is not the case in real fewshot settings.From another aspect, previous ICL methods treat all demonstration examples equally, which can be further improved, as the quality of demonstrations is not even.As shown in Figure 1, assigning different sets of weights to demonstration examples has a significant impact on ICL performance.For a 13B GPT model, in terms of the accuracy on the SST2 dataset, the best weights surpass the worst by 40 points and surpass the non- weighting strategy by 15 points.This highlights the necessity of assigning appropriate weights to demonstration examples.
In this paper, we propose the Weighted Incontext Learning (WICL) method to enhance ICL by determining and applying approximately optimal weights to demonstration examples.There are two key challenges for determining weights: how to evaluate the quality of a set of weights without additional validation data (weight evaluation), and how to efficiently search the best set of weights in the continuous space (weight search).For weight evaluation, we introduce a Masked Self Prediction (MSP) score as a proxy metric, which can be computed solely on demonstration examples but is strongly correlated with the final ICL performance, as shown in Figure 2.For weight search, we discretize the weight space and employ beam search to efficiently find the weights that yield the highest MSP score.With approximately optimal weights discovered, we propose two strategies to apply them to demonstrations.Scaling Key Matrix (SKM) applies the weights to the attention key matrices of demonstration examples, while Scaling Attention Weights (SAW) directly adjusts the attention weights.Both strategies adjust the importance of demonstrations by influencing the attention weights according to a set of weights.
We evaluate WICL on 8 text classification tasks and show that it achieves substantial accuracy improvement over the conventional ICL method that follows a non-weighting strategy.In addition, our approach exhibits robustness to varying shot num-bers and templates.As for the approximately optimal weights discovered by our approach, we find that they demonstrate a close performance to the global optimal weights.Furthermore, we discover that our example reweighting approach mainly works at middle layers, and reweighting only in a few middle layers even outperforms full layers.
We summarize our contributions as follows: 1. We propose the Weighted In-context Learning (WICL) method, which enhances ICL by determining and applying approximately optimal weights to demonstration examples.
2. We introduce a proxy metric called Masked Self-Prediction (MSP) score to foretell the final ICL performance without additional validation data.
3. We evaluate WICL on 8 NLP tasks and show that WICL can significantly outperform the conventional ICL method.
4. We perform elaborate quantitive analysis to reveal the robustness of WICL, and the quality of the discovered approximately optimal weights.

Problem Formulation
For a text classification task, we are given a set of k demonstration examples denoted by S = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x k , y k )}, where x i and y i denote the input text and its corresponding label, respectively.We define a transformation function T that maps each sample (x, y) to an ICL example T (x, y) (e.g.T (x, y) = "Sentence: x Sentiment: y" for the SST2 dataset).The concatenation of these transformed examples is then defined as the final demonstration C: where ⊕ denotes concatenation.The goal of ICL is to identify a suitable demonstration to enable the LLM to generate a probability distribution P that closely approximates the actual distribution, and subsequently select the label y predict with the highest probability as the predicted output: In the above equation, θ and D represent the parameters of LLM and the candidate label set, respectively.Conventional ICL does not take the weight of each example into account, and LLM almost treats each example equally.To address this limitation, we introduce a weight vector w = (w 1 , w 2 , ..., w k ) for the demonstration, where w i denotes the weight of the i th example T (x i , y i ).A higher value of w i indicates that the example T (x i , y i ) is more important, and thus LLM should pay more attention to it.When given a specific demonstration, the objective of WICL is to fully use demonstration examples and identify an appropriate weight vector w that enables the model to achieve performant performance.

Methodology
Demonstration example reweighting faces two challenges: (1) How to add weights to demonstration examples when given a weight vector.(2) How to find a performant weight vector in the absence of a held-out validation set.Section 3.1 aims to solve (1) by reweighting at the self-attention layer, and Section 3.2 aims to address (2) through weight quantization and beam search with an indicator.Our approach is illustrated in Figure 3.

Reweighting Demonstration Examples
Almost all of the current LLMs are based on Transformer decoder architecture, comprising multiple repeated blocks, each of which contains a multihead self-attention layer and a feed-forward network layer.Self-attention layer is responsible for token interaction, and we can modify the weights for each example at this layer.
where W K,i ∈ R d×l i denotes the key matrix for the i th example, and d and l i denote the hidden dimension and length of the i-th example, respectively.After scaling by w, the weighted key matrix can be represented as: and the weighted self-attention is calculated by the following equation: where W V denotes value matrix; W Q denotes query matrix; √ d denotes scaling factor.
Reweighting by Scaling Attention Weights After the softmax layer, attention weights (the product query and key) are mapped into normalized probability distribution space.Adding weights to examples could be scaling attention weights of examples while maintaining the sum of them to be 1.If attention weight for i th example is denoted as att i , we have: We scale original attention weights att by weight vector w and normalize them.After scaling, new attention weight for i th example is :

Discovering Optimal Weights
Indicator for Weight Selection How to select the optimal weight is a challenging problem.One naive approach is to test the model's performance on a validation set under a range of weights and select the weight that yields the best result.However, this approach may not be feasible when an extra validation set is unavailable.Fortunately, we have discovered a new metric called average Masked Self-Prediction score (MSP), which can be obtained without the need for an additional validation set yet still has a strong correlation with the model's performance (as shown in Figure 4).We consider the demonstration set S as a special "validation set" and predict the example labels in S conditioned on weight vector w.This process is called self-prediction since we are predicting the demonstration itself.Label y i in the demonstration C is masked when predicting sample (x i , y i ) to prevent the model from copying the answers from the demonstration.MSP is defined as the average log probability of the predictions on S, and it is calculated as follows: To find a performant weight, we can simply choose a weight with a high MSP score, instead of testing and comparing the performance of all weights on a validation set. in w to values in a candidate weight set Q = {weight 1 , weight 2 , ..., weight n }.This strategy compresses weights from continuous infinite space to discrete finite space and the number of legal weights drops from infinite to n k , making the weight selection problem computationally tractable.However, brute force enumeration of all of the possible weights and calculation of MSP score still requires O(k * n k ) time complexity, which is unscalable for large k.To address this problem, we apply beam search strategy which is commonly used in language model decoding process.From example 1 to example k, in each beam search step we search for a weight for one example, as illustrated in Figure 5.The pseudo-code of our weight selection approach is presented in Appendix A.

Experiments
Similar to previous work (Lu et al., 2022), we use 8 classical text classification datasets, respectively SST2 (Socher et al., 2013), CR (Hu and Liu, 2004), MR (Pang and Lee, 2005), Subj (Pang and Lee, 2004), TREC (Voorhees and Tice, 2000), DBPedia (Zhang et al., 2015), AGNews (Zhang et al., 2015) and RTE (Dagan et al., 2006), to evaluate the efficacy of our demonstration examples reweighting approach.In our experiments, we utilize 5 different GPT-like causal language models released by fairseq (Artetxe et al., 2022), and the number of parameters of these models is 355M, 1.3B, 2.7B, 6.7B, 13B, respectively.1Previous works (Wu et al., 2022;Gonen et al., 2022;Liu et al., 2022;Zhang et al., 2022)  As the experiment results shown in Table 1, the average accuracy is reported under the following three methods: (1) conventional ICL, (2) WICL by scaling key matrix and (3) WICL by scaling attention weights.our approaches outperform conventional ICL on almost all 8 tasks, 5 different models.Besides, SKM also shows more power in performance improvement than SAW.We also try to combine these two reweighting strategies, which means reweighting by scaling key matrix and attention weights simultaneously.As shown in Table 2, combining SAW and SKM does not improve performance; instead, it does harm to the performance and the final performance is a little weaker than SKM.

Analysis
We also further analyze our experiments and investigate properties of our approach.Since SKM outperforms SAW, we do our analysis experiments only with SKM reweighting strategy.Firstly, we compare different example masking/removing strategies when calculating MSP (Section 5.1), and compare MSP with held-out validation set (Section 5.2).Nextly, we explore the robustness of our approach to different templates, shot numbers (Section 5.3, 5.4).Then, we find that our approach can obtain near-optimal weights (Section 5.5).In Section 5.6, we discover that example reweighting mainly works at middle layers, and reweighting on only a few middle layers can even outperform full-layer reweighting.Section 5.7 shows some empirical results on relationship between example Weight and example quality/position.

Label-Only Masking Outperforms
Whole-Example Masking/Removing Our MSP score calculation process is similar to Leave-One-Out Cross-Validation (LOOCV) because we both test one example based on the other k − 1 examples when given k examples.However, unlike LOOCV, we only mask the label of the example to be tested rather than removing it entirely.Inspired by LOOCV, we also compare the performance of label-only making, whole-example masking and whole-example removing.As shown in Table 3, label-only masking outperforms other methods.Whole-example masking/removing results in a performance drop due to the demonstration sensitivity of ICL.When the entire example is masked/removed in MSP calculation, the corrupted demonstration differs significantly from the original one used in final testing, leading to inconsistency that harms ICL performance.Labelonly masking can strike a balance between preserving demonstration information and preventing the model from copying the answers from the demonstration.

MSP is Approximation of a Held-out Validation Set
A held-out validation set can be viewed as a "strong supervision", while our MSP is more like an approximation of this supervision under true few-shot

Example Reweighting is Robust under Different Shot Settings
Our main experiment is conducted on 8 and 16 shot settings, and we aim to explore the effects of example reweighting under different shot settings, as the experimental results presented in Figure 6.
Our findings indicate that our approach does not significantly improve model performance when the number of shots is less than 4; however, when the number of shots is greater than 4, there is a significant improvement over the baseline.Moreover, we observe that the larger the number of shots, the model's ICL performance is not necessarily better, and the ICL performance may gradually decline with the increase of shots.This could be due to information distraction caused by too many examples.On the contrary, our approach can steadily improve with the increase of shot number, which demonstrates that example reweighting can enhance the model's ability to make full use of examples.We also observe that for SST2 (a binary classification task), the ICL performance fluctuates between odd and even shot numbers as the number of shots increases.Differently, this oscillation on WICL is relatively small, indicating that WICL is more robust to odd or even shot numbers.

Example Reweighting is Robust to Different Templates
In our main experiment, only one template is used for each task, and we would like to further explore the robustness of our method to different templates.As shown in Table 5, we use 4 different templates on the SST2 dataset, and for each template, our method can achieve a significant improvement over the conventional ICL method.Details of these templates are shown in Appendix C.Moreover, conventional ICL is sensitive to the choice of templates, and the performance can vary significantly under different templates (e.g.17.2% difference in accuracy under template3 and template4 for GPT 355M).On the contrary, our approach significantly reduces the model's sensitivity to templates and performs well on different templates.

MSP and Beam Search Yields Near-optimal Weights
Finding the optimal example weights can be very expensive as it requires enumerating all legal weights and validating them.In contrast, our approach requires no extra data and has relatively low computational complexity.Therefore, it is worth exploring whether our simplification for obtaining weights significantly impacts performance and how far our example weights are from optimal weights.from the weight space under the 8-shot setting, test the accuracy on the validation set, and plot the cumulative distribution function (as shown in Figure 7).Our findings indicate that our approach outperforms approximately 80% of the weights, demonstrating that the MSP indicator and beam search can yield near-optimal weights.

Example Reweighting Mainly Works at Middle Layers
In our main experiment, example reweighting is applied at all layers, but it is still unclear at which layers the reweighting strategy is most effective.

Limitations
Although our approach can significantly improve the performance of ICL by reweighting the demonstration examples, there are still many limitations: (1) Our approach only has a noticeable effect when the number of examples is greater than 4.
(2) Under our approach, performance variance still exists and the quality of examples still matters to the performance.Our approach can improve performance but can not totally solve ICL's vulnerability to demonstration examples.
(3) Our approach performs reweighting at the self-attention layer, which requires access to the model's parameters; for some large models whose parameters are unavailable (e.g.GPT4 (OpenAI, 2023)), our approach can hardly be applied.(4) We simply add the same weights to all layers or a few consecutive layers, without performing more fine-grained reweighting strategies to different layers (e.g.different example weights for different layers), and the mechanism of example reweighting on the improvement of ICL is still unexplored.

Figure 1
Figure 1: 4-shot ICL performance on SST2 with different sets of weights assigned to demonstration examples.Gray points represent the accuracy of different weight sets, and the red point denotes the accuracy of the nonweighting strategy.The performance varies significantly with different weights.

Figure 2 :
Figure 2: Regression line of MSP score and accuracy on MR dataset.Each point denotes the performance of a weight vector, the Pearson correlation coefficient is 0.73, indicating a strong correlation between MSP score and accuracy.

Figure 3 :
Figure 3: An illustration of weighted in-context learning.Reweighting at the self-attention layer could be scaling key matrix or scaling attention weights.The example weights can be obtained by beam search with masked self-prediction score as an indicator, which shows a strong correlation with final performance.

Figure 4 :
Figure 4: Correlation of MSP and accuracy under different example weights.For each task, we randomly sample 50 legal weights under 8-shot setting and test accuracy on GPT-1.3B,showing scatter plots and regression lines.

Figure 5 :
Figure5: An illustration of beam search for example weights.We take 4-shot setting, beam size = 2 as an example, and legal weight set for each example is{0.8,1.0,1.2}.In each step, we extend beam states and preserve the 2 states with max MSP score.

Table 1 :
Main experiment results.Taking conventional ICL as a baseline, we compare the performance of WICL with two reweighting methods on 8 datasets with different models under 8-shot and 16-shot settings.For simplicity, DBPedia and AGNews are written as DBP and AGN, respectively.
Example Weight Searching To select a proper weight vector w ∈ R k , we employ weight quantization strategy that restricts each dimension w i

Table 5 :
Performance of different models on SST2 under 4 different templates.WICL is robust to different templates.
To explicitly assign weights, we reweight examples in the self-attention layer by scaling key matrix or scaling attention weights; to determine the weight vector for examples, we adopt weight quantization beam search by maximizing masked self-prediction score, which can be obtained with any held-out validation set yet shows a strong correlation with the final performance.Our approach can significantly improve ICL performance in true few-shot setting, and experiments on 8 NLP tasks also demonstrate the efficacy of our approach.

Table 6 :
Templates and label mapping for different tasks.

Table 7 :
4 different templates and label mapping for SST2.