Finding Support Examples for In-Context Learning

Additionally, the strong dependency among in-context examples makes it an NP-hard combinatorial optimization problem and enumerating all permutations is infeasible. Hence we propose LENS, a fiLter-thEN-Search method to tackle this challenge in two stages: First we filter the dataset to obtain informative in-context examples individually. Specifically, we propose a novel metric, InfoScore, to evaluate the example's in-context informativeness based on the language model's feedback, and further propose a progressive filtering process to filter out uninformative examples. Then we propose diversity-guided example search which iteratively refines and evaluates the selected example permutations, to find examples that fully depict the task. The experimental results show that LENS significantly outperforms a wide range of baselines.


Introduction
In-Context Learning (ICL) is a new paradigm using the language model (LM) to perform many NLP tasks (Brown et al., 2020;Dong et al., 2022;Zhao et al., 2023).In ICL, by conditioning on a few train- ing examples, LM can directly output the prediction of a given test input without parameter updates.
Restricted by LM's max input length, it is typical to randomly sample a small set of examples from the entire dataset for in-context learning (Brown et al., 2020;Zhang et al., 2022a).However, in-context learning is sensitive to the provided examples and randomly sampled in-context examples show significant instability and probably cause inferior performance (Lu et al., 2022;Chang and Jia, 2023).In this paper, we propose to select a small list of examples that are informative and representative for the entire dataset as in-context examples.Inspired by the traditional machine learning method, Support Vector Machine (SVM) (Cortes and Vapnik, 1995), where a few support vectors are closest to the decision boundary and provide crucial discriminative information for SVM, we name the selected examples for ICL as support examples since they provide crucial task information for the LM and their quantity is usually limited, too.
There is a similar problem in traditional gradientbased deep learning like fine-tuning (Devlin et al., 2019), typically called Coreset Selection (Guo et al., 2022), which aims to select a set of representative training examples for the dataset to benefit many downstream scenarios like data-efficient learning (Adadi, 2021), active learning (Ren et al., 2022), neural architecture search (Shim et al., 2021), etc.However, it is challenging for these coreset selection methods to select important in-

Acc
The casting of Raymond J. Barry as the 'assassin' greatly enhances the quality ... It was [great] 82.0 At some point , all this visual trickery stops being clever ... context examples because there is a significant discrepancy between traditional training and ICL.As shown in Figure 1, the "learning" paradigms of model training and ICL are highly different.Traditional training depends on back-propagation's gradients to update parameters while ICL occurs in LM's forward process without gradients and parameter updates.Existing coreset selection methods are always coupled with the training procedure, i.e., they usually depend on gradients or run with the training procedure.For example, Paul et al. (2021) select informative examples by their gradients' norm.Toneva et al. (2019) evaluate each example's importance by counting how many times it is forgotten, i.e., the example is misclassified after being correctly classified in the previous epoch.Additionally, coreset selection methods mainly depend on the example's gradients as the feature of example selection (Mirzasoleiman et al., 2020;Killamsetty et al., 2021a,b;Guo et al., 2022).However, LM performs ICL through inference, which does not rely on gradients or parameter updates.Hence, the gap between gradient-based training and ICL makes these methods struggle to effectively select informative examples for in-context learning.
Another challenge is the strong dependency among in-context examples.Previous work (Lu et al., 2022) shows that even the same example set with different orderings can result in drastically different performance from random-guess level to state-of-the-art.Here we also conduct an additional case study to shed light on examples' combinatorial dependency in Table 1.We see that compared with two examples' individual performance, combining them instead significantly hurts the performance.To cope with examples' dependency, a straightforward method is to enumerate all possible examples' combinations and verify their performance.However, it will lead to combinatorial explosion and thus is infeasible.
To tackle these challenges, we propose LENS, a fiLter-thEN-Search that finds support examples in two stages: in the first stage, we filter the dataset to obtain informative in-context examples individually.Specifically, we propose InfoScore to evaluate the example's in-context informativeness based on the LM's feedback, and further propose a progressive filtering process to filter out uninformative examples; In the second stage, we propose a diversity-guided example search method that iteratively refines and evaluates the selected examples to find support examples that can fully depict the task.We summarize our contributions as follows: • To the best of our knowledge, we are the first to define the support examples selection problem for in-context learning and introduce a novel filter-then-search method to tackle it.
• We conduct experiments on various text classification datasets and compare our method with a wide range of baselines.Experimental results demonstrate that our method significantly outperforms baselines and previous coreset selection methods bring marginal improvements over the random baseline, which shows the necessity of ICL-specific designing for finding support examples.
• We conduct further analyses on support examples and find that they exhibit different trends from random examples in many aspects, which can shed light on the principle of them and ICL.We provide the following key takeaways: 1. Support examples are less sensitive to the order compared with random examples (Lu et al., 2022).2. Ground truth labels matter for support examples, while the previous study (Min et al., 2022b)  • We provide comprehensive empirical results of previous coreset selection methods on ICL, which has not been explored.We release the implementation of our method and baselines to facilitate future research1 .

Background:In-Context Learning
In this section, we introduce the definition of incontext learning.We focus on text classification's in-context learning using the causal language model (Radford et al., 2018).Given a language model G, n examples {x i , y i } n i=1 and a test input x test , the prediction of x test is generated as: where Y is the label space and ⊕ is the concatenation operation.To deal with classification tasks, the original label is often mapped to word or words in G's vocabulary.For example, the positive/negative label in a binary sentiment classification can be mapped to "great"/"terrible".For simplicity, we omit the verbalizer, special tokens and prompting templates in Eq (1).
As Eq.( 1) shows, G receives the task's supervision only from the concatenated {x i , y i } n i=1 and directly output the prediction of x test .Typically, n is limited by the max input length of G, so it is typical for researchers to randomly sample a small set of samples from the entire dataset D (Brown et al., 2020;Zhang et al., 2022a).However, ICL is sensitive to the provided examples and random in-context examples show significant instability and probably cause inferior performance (Lu et al., 2022;Chen et al., 2022).In this paper, we focus on selecting a small list of support examples that are informative for the task and performant for in-context learning, from the entire dataset D.

Method
The strong dependency among in-context examples makes selecting support examples essentially an NP-hard combinatorial optimization problem.Enumerating all combinations and evaluating them is infeasible due to the combinatorial explosion.In this section, we propose LENS, a fiLter-thEN-Search method to find support examples: 1. we first filter the training dataset to obtain informative examples individually, 2. then we search the example permutation that fully depicts the task from them.In this paper, we instantiate the two stages as a novel example metric with progressive filtering and diversity-guided example search, we leave the development of more powerful components as future work.We introduce these two stages below.

Informative Examples Filtering
In the first stage, we aim to find those informative examples individually.There are extensive  (Paul et al., 2021), loss value in the early training stage or the times of being forgotten (Toneva et al., 2019), etc.However, these methods struggle to identify important in-context examples since ICL is based on LM-inference without gradients and parameter updates.Here we propose InfoScore (Informativeness Score) to measure the individual in-context informativeness of one example e = {x, y} for ICL based on LM's feedback as: where e ′ = {x ′ , y ′ }, and D is the training dataset.Eq (3) is the gap between the probabilities of the ground truth y ′ conditioned on (e, x ′ ) and (x We filter out uninformative examples in a progressive manner.We first sample a small set of examples from D as initial "score set" (line 2) to coarsely evaluate the InfoScore of each example Algorithm 2 Diversity-Guided Search i=1 , a small validation set V , iteration num I, beam size B, example substitution size B ′ Output: A performant examples' permutation. 1: e * ←Randomly sample an example from E 7: e * new ← argmax e∈D ′ s(e, E − e * ) 8: end for 15: E ← Evaluate E ′ on V and get the top-B 16: end for 17: end for 18: return The top-1 of E and filter the entire dataset to 1/ρ of its original size (line 5).At the following iteration, we proportionally expand the size of the score set to ρ times by randomly sampling more examples from training set (line 13∼15) and use it to calculate In-foScore of the remaining promising examples.As the score set is expanded, the subsequent InfoScore can be calculated in a more fine-grained way and better filter informative examples.Meanwhile, the uninformative examples are filtered out in the previous iteration, which helps save the computational cost.We repeat this procedure until a small set of examples is left.
Thus we achieve filtering examples with high in-context informativeness in the complexity of O(N * log ρ N ), where N is the size of training set.In experiments, we set ρ to N 1 C to make it a linear complexity, where C is a constant.According to the size of dataset, ρ is usually set between 2 -3.

Diversity-Guided Example Search
After where is the final score set of the filtering stage.The subsequent term of s(e, E ′ ) in Eq (5) corresponds to the diversity between e and E ′ , and f (•) is the example's feature vector calculated as: where f (e) describes e's contribution on S's each example e s i in ICL and thus directly encodes e's in-context feature.If two examples' f (•) are similar, their effect on ICL can be redundant and we should avoid selecting both of them in one permutation.Note that I(e, S) and each c(e, e s i ) in f (e) are calculated in the filtering stage and can be reused.
With s(e, E ′ ) and f (e), the updated candidate permutations can be informative and diverse, and help the LM correctly predict various examples, which can better help find the support examples that fully depict the task in ICL.In this paper, we propose and verify a simple yet effective example ICL feature.We leave the development of more powerful ones as future work.
Since examples' order can significantly influence the performance (Lu et al., 2022;Kumar and Talukdar, 2021), we also update E with different orders by randomly shuffling (line 10∼13), which can reduce the risk of missing performant combinations of examples due to poor ordering.In order To explore the example search space more comprehensively and alleviate risk of the local-optimal example permutation, we consider the beam search (Jurafsky and Martin, 2009) here instead of greedy search.Specifically, we update each candidate example permutations by diversity-based example substitution and random shuffling for B ′ and B − B ′ times, respectively (line 5, 10).Then we leverage a small validation set sampled from (D − D ′ ) to evaluate them and keep the top-B permutations with best performance as next iteration's candidates.Through these, we can the mitigate issue of local-optimal example permutation, better iteratively refine and evaluate the candidate permutations with high informativeness and diversity in turn and obtain the examples that can fully depict the task.
To initialize example permutations E with informativeness and diversity, we formulate it as discrete optimization that maximizes e∈E s(e, E − e), which can be solved by the discrete optimization solver like CPLEX (Cplex, 2009).

Method Comparison
We mainly compare our proposed methods with the following baselines: Random: We randomly select examples from the training set; Random & Validation: We evaluate multiple sets of random examples on the validation set and select the best one.We consider Random & Validation under two settings whose computational cost is similar to our method: 1. the size of validation set is the same as ours at stage 2 (100) and the number of random example sets is the same as our searched and evaluated example permutations (640).2. the size of validation set is larger, 1000, and the number of random example sets is 100.We also consider a wide range of Coreset Selection methods in gradient-based learning scenarios, according to the methodologies, they can be divided into multiple categories including: Geometry-Based Method: it assumes that data points that are close in the feature space have similar properties, including Herding (Chen et al., 2012) and K-Center Greedy (Sener and Savarese, 2018); Uncertainty-Based Method: it assumes examples with higher uncertainty can have a greater impact on the model and should be contained in coreset, including Least Confidence, Entropy, Margin (Coleman et al., 2020) and CAL (Mar-gatina et al., 2021); Error/Loss Based Method: It assumes the example that contributes more to the error or loss during training is more important and should be included in coreset, including Forgetting (Toneva et al., 2019) and GraNd (Paul et al., 2021); Gradient Matching Based Method: Since deep models are usually trained by gradient descent, it tries to find a coreset whose gradients can imitate the entire dataset's gradients, including CRAIG (Mirzasoleiman et al., 2020)  Implementation Details For the LM, we follow Min et al. (2022a) to use GPT2-L (Radford et al., 2018).We set the number of retained examples of filtering m, the weight of diversity λ, the beam size B and the number of diversity search iterations as 500, 1, 8 and 10 respectively.We show the overall hyper-parameters, implementation details, analysis details and complexity analysis in Appendix B. For baselines and LENS, we run each method under 4 prompt templates over 10 random seeds (40 in total) and report the average performance with and without calibration (Zhao et al., 2021), unless otherwise specified.We show the overall templates and dataset statistics in Appendix C and D.

Main Results
We show the results in Table 2.We observe that our method significantly outperforms baselines on all datasets with or without calibration mechanism, which shows our method's best overall ability to find task-representative support examples across different settings and task families.Specially, our method shows better performance than the Random-Validation baseline and this directly demonstrates its non-triviality.Meanwhile, previous methods for gradient-based learning have similar performance with the Random baseline, and this  indicates: 1. there is a non-negligible gap between ICL and these methods 2. it is necessary to design to ICL-specific method to find support examples.Additionally, the Random baseline slightly underperforms the Zero-Shot, which shows that random examples are hard to fully characterize the task and the necessity of finding support examples for ICL.In experiments, we observe that Random-Validation suffers from the ICL's instability.Specifically, we find that considerable example permutations selected by the validation set do not consistently yield satisfactory results on the test set and this degrades its performance, whereas our method is more robust and less susceptible to this issue.

Analysis
The Sensitivity of Support Examples to Orders The recent study (Lu et al., 2022) shows that the ordering of in-context examples for ICL has a significant influence on the performance.Specifically, to the same set of randomly sampled examples, different orders can result in near state-of-the-art and random-guess performance.In this section, we explore the effect of ordering for our support examples on SST-2, Amazon, MR and Subj.For each task, we select four sets of support examples and four sets of random examples and then evaluate their performance with eight randomly sampled orders.We show the performance distribution in Figure 2. We see that random examples with different orders show highly unstable performance where the worst drops to the random-guess level, which is consistent with the conclusion in previous work (Lu et al., 2022).In contrast, the support examples' performance is significantly more stable than random examples.Generally, most orders can still lead to approximately equivalent performance as the searched orders and few orders lead to the random-guess performance.The phenomenon is compatible with the conclusion from the recent work (Chen et al., 2022), which shows a strong negative correlation between ICL sensitivity and accuracy.Moreover, our support examples' lower sensitivity to the ordering demonstrates that they can more effectively depict and characterize the corresponding task.
Transferablity across Different LMs In the main experiments, we get GPT2-L's support examples and evaluate them using the same LM.And InfoScore and Progressive Filtering In this section, we evaluate the effect of InfoScore and progressive filtering in stage 1.Specifically, we randomly sample examples from the retained examples of stage 1 and test their average performance across 4 prompt templates with 10 random orders (40 in total).We compare the Random baseline, our filtering method and another filtering variant that filters uninformative examples, which retains those examples with low InfoScore at each iteration.We show the results in Table 5.We observe that just the proposed filtering method also leads to better ICL performance than randomly sampled examples, which directly shows that stage 1 is effective for filtering out the uninformative examples.Meanwhile, the performance points of Filtering (Informative), Random and Filtering (Uninformative) present a descending trend, which demonstrates that the proposed InfoScore can indicate the examples' in-context informativeness.However, compared with our entire method, Filtering (Informative) still shows a significant discrepancy.This indicates the necessity of considering in-context examples' dependency and the effectiveness of the proposed diversity-guided search.

Related Work
Since we introduce a wide range of coreset selection methods in Section 4.1, we omit them here and mainly introduce previous works about example selection for ICL.Previous works mainly consider example-level retrieval for ICL.2023) further consider diversity in example retrieval.Different from these methods which aim to provide example-specific information for the test input, we focus on task-level example selection, which seeks to find examples that are representative for the task and is complementary to them.Moreover, because the large language models (Brown et al., 2020;Zhang et al., 2022a;Black et al., 2021) almost adopt purely causal Transformer (Vaswani et al., 2017) (Efrat and Levy, 2020;Gao et al., 2022;Ye et al., 2022;Meng et al., 2022;Cheng et al., 2023).

Conclusion
In this paper, we propose a two-stage filter-thensearch method to find support examples for incontext learning from the annotated dataset: First we propose InfoScore to select informative examples individually with a progressive filtering process.Then we propose diversity-guided example search which iteratively refines and evaluates the selected examples, to find the example permutations that fully depict the task.The experimental results show that our method significantly outperforms extensive baselines, and further analyses show that each component contributes critically to the improvements and shed light on the principles of support examples and in-context learning.

Limitations
These are the limitations of this work: • Due to the computation resources limitation, we mainly conduct experiments on GPT2-L (Radford et al., 2018) and analyze the cross-LM transferability of support examples in section 4.3.We see the exploration on more LMs as future work.
• In this paper, the proposed filter-then-search framework explores how to find support examples of in-context learning.We see exploring and analyzing more principles of in-context learning as future work.
• Language models have exhibited various kinds of bias (Bender et al., 2021), since our filtering stage is based on its feedback, the filtered example might also exhibit these biases.We see language model debiasing as an important future research topic.

A Baselines
We To reduce the gap between these methods and ICL, we use the same LM (GPT2-L) with "last pooling" fine-tuned on the whole dataset for 5 epochs to obtain relevant metrics, e.g., gradients or forgetting times for these methods.Following Guo et al. (2022), we use the gradients of the final fully-connected layer's parameters as these methods' example feature.

B.1 Baseline Details
For those previous coreset selection methods, to reduce the gap between these methods and ICL, we use the same LM (GPT2-L) with "last pooling" fine-tuned on the whole dataset for 5 epochs to obtain relevant metrics, e.g., gradients or forgetting times for these methods.Following Guo et al. (2022), we use the gradients of the final fullyconnected layer's parameters as these methods' example feature.For baselines that output a weighted subset of examples, e.g., CRAIG or GradMatch, we just adopt its examples for simplicity since there are few methods to weighting in-context examples for ICL.

B.2 Method Details
We find each prompting template's corresponding support examples separately in our method and compared methods, i.e., we select examples for each prompting template separately.For simplicity, we calculate Eq (3) without calibration for experiments without or with calibration.For the filtering stage, we set the progressive factor and the size of initial score set according to the dataset's size.Specifically, we set the progressive factor to make the filter iterations be 4.
We run all experiments under the label balance setting and the total number of in-context examples for most datasets except DBPedia is set to 8. The number of some datasets' in-context examples is not 8 but close to 8 because 8 can not be divided by the number of its labels, e.g., 5 for SST-5.Since DBPedia has 14 labels and significantly longer input sequence, we run experiments on it under the label-unbalance setting and set the total number of examples to 4. In label balance setting, we 1. filter the same number of examples for each label, 2. initialize the example permutation of stage 2 with balanced labels 3. update e * with e * new , whose label is the same as e * .
We list the total number of examples in Table 6.And we set the size of initial score set to make the times of LM's forwards to be around 10K.We list the progressive factor p and the size of initial score set |S 0 |in Table 7.For other hyper-parameters, we conduct grid search for the number of retained examples of filtering m, the weight of diversity λ, the beam size B and the iteration of diversityguided search over {500,1000}, {0.5,1,2}, {4,8,16} and {5,10,15} respectively on the SST-2 dataset.And we set them to be 500, 1, 8 and 10 respectively.

B.3 Experimental Details
In section "The Sensitivity of Support Examples to Orders", since the performance is sensitive to the prompting templates, we show the performance distribution under a specific prompting template.In other analysis experiments, for simplicity, we report the average performance under four different prompting templates without calibration, unless otherwise specified.

B.3.1 The Complexity of Our Method
Progressive Filtering in the filtering stage, we need to compute pairwise Eq 3 for N * l * ρ/ρ = N * l times, where N is size of training set.Since we filter the dataset into 1/ρ of its previous size until a to make it a linear complexity, where C is a constant.According to the size of dataset, ρ is usually set between 2 -3, shown in Table 7.
Diversity-Guided Example Search At each iteration, we have B candidate permutations and separately update them B times.And then we evaluate these updated candidate permutations on the small validation set sampled from the remaining training set, whose size is fixed.Since updating the candidate permutations reuses the intermediate results of the filtering stage and does not involve the computation of the LLM (see Eq (4) and ( 5)), we omit it for complexity analysis.So the complexity of diversity-guided example search is consistant, B * B.

C Prompting Templates
We show the prompting verbalizers and templates in Table 8.

D Dataset Split and Statistics
We use the same dataset split in the previous work (Min et al., 2022a).Due to computational resource limitations, for Amazon, AGNews and DBPedia, we conduct experiments on a randomly sampled subset of it (30000 and 2000 for the training and test set), and we show the overall dataset statistics in Table 6.
Figure 1: In-context learning and model training learn in different ways , where and are forward and backward processes, respectively, and L(•,•) is the loss function.
and Grad-Match(Killamsetty et al., 2021a); Submodularity-Based Method: Submodular functions (Iyer and Bilmes, 2013) naturally measure a subset's informativeness and diversity and can thus be powerful for coreset selection, including Facility Location and Graph Cut (Iyer and Bilmes, 2013); Bilevel Optimization Based Method: It transforms the coreset selection problem into a bilevel-optimization problem whose outer and inner objectives are subset selection and model parameter optimization, respectively: Glister(Killamsetty et al., 2021b).Due to the page limit, we introduce these methods and their implementation details in Appendix A and Appendix B.1, respectively.

Figure 2 :
Figure 2: The performance distribution of support examples and random examples with multiple orders.
Liu et al. (2022) leverage a semantic embedder to retrieve relevant examples for the given test input.Das et al. (2021) and Hu et al. (2022) use dense retrievers trained by task-specific targets' similarities to retrieve in-context examples for question answering and dialogue state tracking, respectively.Rubin et al. (2022); Shi et al. (2022) train the demonstration retriever based on the feedback of the language model for semantic parsing.Wu et al. (2022) use Sentence-BERT(Reimers and Gurevych, 2019) to retrieve relevant examples and introduce an information-theoretic-driven criterion to rerank their permutations.Levy et al. (2022);Ye et al. (

Table 1 :
The case study of examples' combinatorial dependency on SST-2, where "it was [great/terrible]" is the template.Although two examples bring good performance separately, combining them instead hurts performance.
filtering, we get individually informative examples D ′ .Since the in-context examples have high combinatorial dependency (see Table 1), a straightforward method is to enumerate all possible combinations and evaluate them on a validataion set.However, although we have reduced the candidate examples by filtering, it is still impossible to evaluate all combinations.For example, if there are 50 examples retained after filtering and we want to find a combination of 8 examples from them, it can lead to C 8 50 (about 536 million) combinations, let alone considering the examples' orders.Hence we propose diversity-guided example search to iteratively refine the example selection from filtered examples and obtain the support examples, as shown in Algorithm 2. It starts with a set of initial example permutations.At each iteration, we use the diversity of in-context examples to guide the update of the current candidate permutations.Specifically, for each candidate permutation E = [e i ] n i=1 , we randomly select an example e * in E and update it with the example e * new as:

Table 4 :
Impact of different hyper-parameters.
with the Random baseline in general, across various hyper-parameter configurations, which indicates our method's robustness to hyper-parameters.Meanwhile, we observe two slight performance degradations when m = 100 or B = 1.For

Table 5 :
The Effect of Progressive Filtering.
(Ren et al., 2022)re, we can calculate task-level in-context examples representation in advance and reuse them for different test inputs.Since these two settings' goals are orthogonal and complementary, we regard the hybrid setting and method as future work.Another line of methods is active learning(Ren et al., 2022)for ICL.It aims to select some examples from a large pool of unlabeled data and annotate them for ICL.Zhang et al.
Since deep models are usually trained by gradient descent, it tries to find a coreset whose gradients can imitate the entire dataset's gradients.
Paul et al. (2021)chs.Paul et al. (2021)propose the GraNd score to select informative examples.GraNd is the gradient norm expectation of the example.The larger one example's GraNd is, the more important it is.Gradient Matching Based Method:

Table 6 :
Data statistics.small set of examples is left, the number of iteration is log ρ N ).Thus the filtering stage's complexity over N is O(N * log ρ N ).In experiments, we set