Coverage-based Example Selection for In-Context Learning

In-context learning (ICL), the ability of large language models to perform novel tasks by conditioning on a prompt with a few task examples, requires these examples to be informative about the test instance. The standard approach of independently ranking and selecting the most similar examples selects redundant examples while omitting important information. In this work, we show that BERTScore-Recall (BSR) selects better examples that demonstrate more of the salient aspects, e.g. reasoning patterns, of the test input. We further extend BSR and many standard metrics to easily optimizable set-level metrics, giving still better coverage of those salient aspects. On 15 datasets spanning 6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metric for in-context example selection across the board, and (2) for compositional tasks, set selection using Set-BSR outperforms independent ranking by up to 17 points on average and, despite being training-free, surpasses methods that leverage task or LLM-specific training.


Introduction
Large language models (LLMs) (Devlin et al., 2019;Brown et al., 2020) are capable of generalizing to novel tasks (Brown et al., 2020) by conditioning on textual prompts consisting of a few task examples.This training-free paradigm of fewshot inference, known as in-context learning (ICL), reduces the cost of modeling new tasks while also providing an interpretable and customizable interface (Liu et al., 2022;Wei et al., 2023) and improving generalization (Anil et al., 2022;Qiu et al., 2022b;Drozdov et al., 2023) and reasoning skills (Wei et al., 2023).However, ICL performance is critically sensitive to the choice of demonstrations (Zhao et al., 2021;Liu et al., 2022;Lu et al., 2022;Rubin et al., 2022; Schick and Schütze, 2021), as the LLM relies on them for understanding and solving the test instance.
The standard approach to selecting ICL examples or demonstrations from a pool of candidates is to independently score them using a relevance metric and choose the top-ranked ones.However, cosine similarity and BM25, the two commonly used metrics, are sub-optimal for selecting demonstrations due to their reliance on a single dense embedding and unigram overlap, respectively.Moreover, since it selects examples independently, this approach ignores their utility as a set.It is particularly inadequate for complex compositional tasks like semantic parsing (Levy et al., 2022) where no single candidate might contain all reasoning patterns, and an independent selection would select multiple redundant examples with the same reasoning patterns but fail to demonstrate the others.Figure 1 shows a failure case where similarity-based selection picks paraphrased examples that fail to demonstrate how to find a manager.Prior work on selecting demonstrations as a set (Ye et al., 2023;Levy et al., 2022) required task and/or LLM-specific training, limiting their utility.For this reason, simple yet widely applicable training-free methods like BM25 and cosine similarity remain the most popular approaches for ICL example selection.
In this work, we propose a novel framework for selecting sets of maximally informative demonstrations for the salient aspects of the test input, e.g., reasoning patterns, entities, etc. Examples selected using this framework are informative about the test input and help the LLM understand and perform the task.We use this framework to explore different ways to characterize salient aspects, including syntactic structures like dependency parse subtrees and contextual token embeddings, while using BM25 and BERTScore (Zhang et al., 2020) to measure their coverage, respectively.To select the demonstrations as a set, we extend the coverage metrics to measure the overall informativeness of a set of demonstrations.We show that these set-level metrics are submodular and can be efficiently optimized to find demonstration sets that maximally cover the salient aspects.We evaluate our ICL example selection methods on 15 diverse datasets, including 6 semantic parsing, 2 numerical reasoning, and 7 classification datasets, and with 7 LLMs of varying sizes and pretraining.Among instance-level metrics, BSR, the recall version of BERTScore, consistently outperforms standard retrieval metrics on all datasets and LLMs, beating cosine similarity by up to 8 points on average in semantic parsing datasets and 15 points in the rest.Selecting demonstrations as a set using SET-BSR, the set-extension of BSR, leads to further gains in semantic parsing and is particularly effective in compositional settings where the gains grow with LLM size.With Codex, a 175B parameter LLM, SET-BSR outperforms cosine similarity by 17% on average with up to 49% improvement in some splits, and, despite being training-free, outperforms even trained methods like those from Rubin et al. (2022), Levy et al. (2022), and Ye et al. (2023) that require task and/or LLM-specific training.

Related Work
In-context learning for few-shot inference facilitates the use of LLMs for novel tasks without the need for expensive supervised fine-tuning.In addition to reduced cost, it has several other advantages over supervised fine-tuning: it provides a more interpretable and customizable interface to using LLMs (Liu et al., 2022;Wei et al., 2023); and retention of linguistic understanding and knowledge from pretraining leading to improved generalization (Anil et al., 2022;Qiu et al., 2022b;Drozdov et al., 2023) and reasoning skills (Wei et al., 2023).
However, the performance of ICL is critically sensitive to the choice of demonstrations (Zhao et al., 2021;Liu et al., 2022).This has led to a growing interest in techniques for selecting good demonstrations.Prior work can be roughly classified into (1) independently scoring and retrieving examples (Liu et al., 2022;Rubin et al., 2022), (2) selecting diverse examples to reduce redundancy among them (Su et al., 2022;Levy et al., 2022;Agrawal et al., 2022;Ye et al., 2022), and (3) selecting examples that minimize the entropy of the LLM's output distribution for the test input (Lu et al., 2022;Wu et al., 2023).Recent work has also trained RL agents (Lu et al., 2023) and used Bayesian inference (Wang et al., 2023).
The most similar studies to ours are Levy et al. (2022) and Ye et al. (2023).Levy et al. (2022) select diverse demonstrations that cover substructures of the target output predicted by task-specific classifiers but are limited in applicability to a few semantic parsing tasks.Ye et al. (2023) use Determinantal Point Processes (Kulesza, 2012) to select a diverse set of demonstrations similar to the test instance but do not optimize for coverage directly and require training with the LLM.Moreover, both methods require task or LLM-specific training that limits their use and effectiveness for larger LMs.

Preliminaries
In-context learning is the ability of LLMs to solve novel tasks by merely conditioning on a few task demonstrations.Formally, given demonstrations {(x i , y i )} k i=1 and the test input x test , it involves using textual templates to linearize instance inputs and outputs into sequences of tokens from the LLM vocabulary, x = I(x) = ⟨x 1 . . .x |x| ⟩ and y = O(y) = ⟨y 1 . . .y |y| ⟩.The linearizations are then concatenated to form a prompt and fed to the LLM for conditional generation of the test output: The interpretable and training-free nature of ICL makes it an attractive alternative to supervised finetuning.However, its performance is highly sensitive to the choice and order of demonstrations.
Demonstration Selection identifies which examples to include in the prompt for any test instance.Formally, given a test input x test and a pool of candidates , the goal is to select a subset of k ≪ N demonstrations that when included in the context make y test the most likely generation.A naive approach is to randomly sample k instances from T , but this is sub-optimal since the demonstrations are often completely unrelated to the test input.Instead, the standard approach to selecting demonstrations that are informative about the test input is to independently assign each candidate z a score score(x test , z) using a relevance metric and then select the top k candidates.

Relevance Metrics
The two most commonly used relevance metrics for scoring demonstration are cosine similarity and BM25.Cosine similarity uses a representation function R to independently map the textual linearizations of inputs to unit-norm embeddings r x = R(x) in a common vector space and then scores the candidate z using the dot product, cosine(x test , z) = r T xtest r z .BM25, on the other hand, is a sparse information retrieval algorithm belonging to a class of TF-IDF measures that view the test input and the candidates as bags of terms and measures relevance as a weighted recall or coverage of these terms: Here T x and T z are the set of terms in x and z respectively, and tf(s, T z ) and idf(s) are the term frequency and inverse document frequency statistics that measure the coverage of a particular term and the relative importance of terms respectively.We use tf and idf as per the Okapi variant of BM25 (Robertson et al., 1993;Jones et al., 2000).

Informative Demonstrations
The limitation of the standard demonstration selection approach is that by independently scoring the demonstrations, it ignores their utility as a set.For ICL to work, the demonstrations included in the context need to be informative about how to understand and solve the test input.In this section and the next, we describe our approach to selecting informative sets of demonstrations for ICL.We begin by defining our notion of informativeness of demonstrations in ICL and describing how to measure it.Thereafter, in §5, we will discuss how to extend this notion to an algorithm for selecting optimally informative sets of demonstrations.
Informativeness Demonstrations should demonstrate the salient aspects, e.g., reasoning patterns, entities, etc., of the test input.Formally, denoting S xtest as the set of salient aspects of the test input, we measure the informativeness of a demonstration z in terms of the coverage of such salient aspects, where c(s, z) measures the coverage (or recall) of a single salient aspect s by z.
Salient Aspects Both cosine similarity and BM25 are special cases of Eq. 3 for different notions of salient aspects.For BM25, S xtest = T xtest , the set of unigrams in x, and c(s, z) = idf(s)tf(s, T z ).
And cosine similarity, although not explicitly a recall metric, can also be interpreted as evaluating coverage of the dimensions of the test input embedding by defining S xtest = [1, d], the dimensions of the dense embedding as the salient aspects, i.e., The above interpretations reveal why neither cosine similarity nor BM25 are good measures of informativeness.While cosine similarity captures some aspects of semantic similarity (depending on the embedding), it is limited to a single embedding.And, unigrams, the commonly used terms with BM25, are too small to capture most salient aspects.A good measure of informativeness necessitates an accurate characterization of salient aspects.One way might be to use larger syntactic substructures of the input as terms with BM25.We experiment with using larger n-grams and subtrees of the dependency parse tree.However, such syntactic structures are constrained to the surface form of the instance and hence may not capture meaning and aspects like reasoning patterns.A better way to capture salient aspects is to use contextualized token embeddings, the idea behind the BERTScore (Zhang et al., 2020) metric.
BERTScore was originally proposed as a metric for evaluating the quality of machine-generated text (e.g., machine translation) by comparing it to a reference text.It leverages pre-trained contextual embeddings to match words in the candidate and reference sentences by cosine similarity and compute precision, recall, and F1 measures.Formally, given the sequences of contextual embeddings ⟨x 1 , x 2 , . . ., x |x| ⟩ and ⟨z 1 , z 2 , . . ., z |z| ⟩ of tokens in x = ⟨x 1 , x 2 , . . ., x |x| ⟩ and z = ⟨z 1 , z 2 , . . ., z |z| ⟩ respectively, the recall measure, BERTScore-Recall (BSR), is defined as: Here, w(x i ) is a weight assigned to token x i and can be defined as 1 |x| if treating each token as equally important or idf(x i ) x i ∈x idf(x i ) if downweighting rare words.The precision measure is defined analogously, while the F1 measure is the harmonic mean of the two.BSR is also a special case of Eq. 3 with contextualized tokens as salient aspects, i.e., S x = ⟨x 1 , x 2 , . . ., x |x| ⟩ and can be used to select examples by treating them as candidates and the test input as the reference.The following table summarizes the informativeness measures and salient aspects in this work.

Metric Salient Aspects
Cosine embedding dimensions BM25 unigrams, n-grams, dependency parse subtrees BERTScore contextual token embeddings 5 Set-level Information Coverage So far, we have focused on measuring the informativeness of a single demonstration to rank and independently select the most informative ones.However, as depicted in Fig. 1, when no single single candidate demonstrates all salient aspects, this approach can fail to cover all of them while also selecting redundant demonstrations that provide no new information.A scenario where this can happen is when the candidate pool contains close paraphrases (or duplicates).This suggests that demonstrations should be selected as a set.
Set Metric To evaluate the informativeness of a set of examples Z, we propose to extend the coverage measure in Eq. 3 to a measure for sets as follows: Intuitively, this measures the coverage of each salient aspect as the best coverage it receives from any example in the set.In other words, maximizing it requires that every salient aspect appears at least once in some demonstration without considering which or how many.Since cosine similarity, BM25, and BSR are all special cases of Eq. 3, they can be extended to set measures using Eq. 6.
Submodularity Given the combinatorial space of sets of demonstrations, for a measure on sets to be practical, it needs to be efficiently optimizable.Fortunately, the set-level metric, as defined above, is also submodular for any definition of c(s, z).We prove this in Appendix A. Intuitively, this follows from the facts that (1) for any given test instance, c(s, z) assigns a scalar weight to each demonstration z ∈ Z, (2) the maximum of weights across set elements is submodular, and (3) the sum of submodular functions is also submodular.This means that the set-level metric can be optimized using a greedy algorithm with a constant factor approximation guarantee (Nemhauser et al., 1978).
Algorithm The greedy algorithm we use to select the optimal set is shown in Algorithm 1.In every iteration, it selects the example that maximally increases the coverage of the current set of demonstrations (lines 5-9).If no such example exists, it resets (lines 11).Using the following identity when computing the score for candidate sets (line 5), and assuming constant time for computing each c(s, z), the time complexity of algorithm is O(kN L), where L = |S xtest |.For BSR, the complexity of computing c(x, z) for all z ∈ Z is O(T d), where T is the total number of tokens in Z and d is the token embedding size.Thus, the time complexity of both instance and set-level BSR is dominated by the computation of c(x, z), and is O(LT d).While slower than cosine and BM25, we found it to be a small overhead to in-context learning for most datasets considered in this work.We discuss this further in App. C.
In addition to the standard IID splits, we also evaluate compositional generalization using compositional splits wherever available.For GeoQuery we use three types of compositional splits: Template (Finegan-Dollak et al., 2018), TMCD (Keysers et al., 2020), and Length.Following Levy et al. (2022), we use the compositional splitsthree Template, three TMCD, and one Lengthgenerated by Qiu et al. (2022a) and average results across the TMCD and Template splits.For ATIS and Overnight, we experiment with Template splits (Finegan-Dollak et al., 2018) generated by Gupta et al. (2022).For SMCalFlow, we experiment with splits in SMCalFlow-CS (Yin et al., 2021): an IID split (8-S) and a compositional split (32-C).
For all the splits, following prior work (Ye et al., 2023;Rubin et al., 2022) we randomly subsample 44,000 instances from the train set to use as pool to select demonstrations from.For evaluation, we use a random subsample of 1000 instance of the validation set if available, and the test set otherwise.We use Exact Match (EM) accuracy for all datasets except BREAK where we use LF-EM (Hasson and Berant, 2021), which is preferred over EM for semantic equivalence.
Additionally, we experiment with (1) a random baseline (RANDOM) that randomly selects demonstrations from the pool, and (2) with the set-extensions of COSINE, BM25 and BSR as described in §5 which will be referred to as SET-COSINE, SET-BM25, and SET-BSR respectively.

Trained Methods
We also compare with methods that require task or LLM-specific training.EPR (Rubin et al., 2022) uses LLM perplexity to train a dense retriever for each dataset.CEIL (Ye et al., 2023) uses EPR and an LLM to train a Determinantal Point Process (Kulesza, 2012) for each dataset and then uses it to select examples.We use Ye et al. (2023)'s implementation of EPR and CEIL and use GPT-Neo-2.7BLLM.We also compare with LFCOV (Levy et al., 2022), a method for semantic parsing, specifically SMCalFlow-CS and GeoQuery.It trains a classifier to predict logical form substructures and then selects diverse examples containing them.We use the shots provided by the authors.

Prompt Construction
For k-shot (we use k = 8 unless specified otherwise) ICL with any given dataset ( § 6.1), demonstration selection method ( § 6.3) and LLM ( § 6.2), we construct the prompt as follows: (1) select up to k demonstrations depending on the context window of the LLM; (2) order the demonstrations in increasing order of relevance so that the most relevant demonstrations appear closest to the test input; and (3) linearize the ordered demonstrations and the test input using the dataset's prompt template in Table 5   dered by their corresponding instance-level score.For the trained baselines, we use orderings recommended by the corresponding authors.

Results
We begin by comparing the performance of our proposed methods, BSR and SET-BSR, with prior training-free and state-of-the-art trained methods in § 7.1.We then analyze the different metrics for measuring informativeness of individual demonstrations ( § 7.2) and the impact of coverage-based set selection using our set extension ( § 7.3).Table 3, we see that, unlike SET-BSR, BSR is effective even for non-semantic parsing datasets outperforming COSINE by 15 points on average with GPT-Neo-2.7B(see Table 12), and often even EPR and CEIL (see Table 13).All the above improvements were statistically significant (p < 0.05) under paired permutation-tests.

SET-BSR is more effective with larger LLMs
The effectiveness of SET-BSR monotonically improves as LLMs become more powerful.The trend is particularly pronounced in compositional splits, where it gets 25% absolute improvement v/s CO-SINE on average (see Fig. 2) and 49% improvement on the 32-C split of SMCalFlow-CS (see Table 2).

Trained methods do not leverage larger LLMs
As EPR and CEIL are trained using GPT-Neo-2.7B,they have difficulty generalizing to and taking ad- vantage of larger, more powerful LLMs, becoming less effective on IID splits (Fig. 2), and failing on GSM8K (Table 3).The latter is likely because GPT-Neo-2.7Bitself fails on GSM8K (Table 13), which requires Chain-of-Thought reasoning, an emergent ability of larger LLMs (Wei et al., 2022).As training with increasingly large LLMs is prohibitively expensive and impractical, these results demonstrate serious limitations of trained methods.

Measure of Informativeness
Contextual embeddings capture salient aspects From Tables 1 and 3, it is clear that BSR consistently outperforms COSINE and BM25.This is true even when using the same encoder (see App. D), is seen in both IID and compositional splits (see Fig. 2), and with varying number of demonstrations (see Fig. 4).Larger syntactic substructures did not improve BM25 as seen in Table 4 (Bottom).These results show that contextual embeddings are indeed better at capturing salient aspects.
Recall outperforms other measures Comparing the variants of BERTScore, for Codex in Table 4 (Top), and other LLMs in Fig. 7 in App.D, it is evident that recall is on par with, or better than, the F1 metric.This supports our hypothesis that recall or coverage (of salient aspects) is a useful metric for informativeness.We include additional ablations in App.D, analyzing the effect of using importance weighting (IDF) and using a larger LM to compute token embeddings for BSR.

Coverage-based Set Selection
Impact on performance From Fig. 3, we see that coverage-based set selection is most effective in compositional splits where it improves the average performance of all metrics, including COSINE.This shows the importance of selecting demonstrations as a set in compositional settings where examples demonstrating all the salient aspects of the test input are even less likely to exist.The set extension is less effective in IID splits and even hurts performance for COSINE and vanilla unigram BM25.
Overall, BSR and BM25 with larger substructures benefit the most from the set extension.We provided further analyses of improvements from set selection and the impact of reordering in App.D.  the relevance of individual demonstrations to prioritize coverage of all aspects with the set (see Table 9 for an example from GSM8K).Additionally, even contextual token embeddings can only capture salient aspects that are explicitly expressed in the input text and thus may not be suitable for tasks where the salient aspects are more abstract and require reasoning themselves (see Table 10 for an example from QNLI).We leave it to future work to explore better measures of informativeness, including better characterizations of salient aspects.

Conclusion
This paper presents a novel framework for selecting informative sets of demonstrations that cover salient aspects of the test input to aid the language model (LLM) in solving it.We explore different ways to characterize these aspects and quantify their coverage.Evaluation on a wide range of tasks and LLMs validates the effectiveness of BERTScore-Recall as a measure of informativeness of individual demonstrations.Further, our results demonstrate the superiority of SET-BSR in selecting informative sets of demonstrations compositional tasks like semantic parsing and highlight the ability of coverage-based demonstration selection, unlike trained methods, to leverage increasingly powerful larger LLMs.Our code base is available at https://github.com/Shivanshu-Gupta/icl-coverage.A Submodularity Definition A.1 (Submodular Function).If Ω is a finite set, a submodular function is a set function f : 2 Ω → R, where 2 Ω denotes the power set of Ω, which satisfies one of the following equivalent conditions.
1.For every X, Y ⊆ Ω with X ⊆ Y and every x ∈ Ω\Y we have that f 2. For every S, T ⊆ Ω we have that 3. For every X ⊆ Ω and Theorem A.1.The function f maxw (X) = max x∈X w x is submodular for any assignment of weights w x to the elements x ∈ Ω.
Proof.The following are clearly true for any x ∈ Ω and any Adding these two inequalities together, we get the third definition of submodularity and thus f maxw is submodular.
i=1 are all submodular functions, then n i=1 f i is also submodular.
Proof.We show this for n = 2: Therefore, f 1 + f 2 is submodular using the second definition of submodularity.By induction, this is true for any number n of functions.
Theorem A.3.The set-level coverage metric setcov (x test , Z) as defined in Eq. 6 is submodular for any definition of c(s, z).
Proof.From Theorem A.1, the function f s (Z) defined as f s (Z) = max z∈Z c(s, z) is submodular for any definition of c(s, z).Further, since from Theorem A.2, the sum of submodular functions is also submodular, setcov (x test , Z) =

B Datasets
We use 15 diverse datasets, including 6 semantic parsing, 2 numerical reasoning, and 7 classification datasets.

B.1 Semantic Parsing
We use 6 semantic parsing datasets with IID and compositional splits for our experiments.Table 5 shows sample instances from each dataset we experiment with along with the textual template we use to linearize the instances.The ICL prompt is constructed by concatenating the templatized demonstrations and the test instance using \n\n as the separator.GeoQuery (Zelle and Mooney, 1996): A dataset containing 880 natural language questions about US geography paired with Prolog programs.In addition to the standard (IID) split, we experiment with three types of compositional splits: (1) Template split where the training and test sets have disjoint program templates (Finegan-Dollak et al., 2018); (2) TMCD split which creates train and test sets with maximal compound divergence and minimal atom divergence (Keysers et al., 2020); and (3) Length split which evaluates for length generalization by testing on sequences longer than ones in training.Following Levy et al. (2022), we use the compositional splits -three Template, three TMCD, and one Length -generated by Qiu et al. (2022a) and average results across the TMCD and Template splits.ATIS (Hemphill et al., 1990;Dahl et al., 1994): A dataset of natural language queries about aviation paired with λ-calculus programs.We experiment with an IID split and a Template split (Finegan-Dollak et al., 2018) for evaluating compositional generalization, both taken from (Gupta et al., 2022).Overnight (Wang et al., 2015): A dataset containing both synthetic and natural language utterances from 11 domains (e.g.socialnetwork, restaurants, etc.) paired with Lambda-DCS logical forms.We experiment with an IID and a Template split of the socialnetwork domain taken from (Gupta et al., 2022).SMCalFlow (Andreas et al., 2020): A dataset of task-oriented natural language dialogs about calendars, weather, places, and people paired with executable dataflow programs.SMCalFlow-CS (Yin et al., 2021) is a subset of SMCalFlow containing single-turn dialogs involving two domains (organization structure and calendar event creation), each having its own set of program symbols with two types of test sets: a cross-domain (C) test set containing only instances where both domains appear and meant to test for compositional generalization, and a single-domain (S) test set contains instances with only single-domain for in-distribution evaluation.For compositional evaluation, we use the 32-C split which is a few-shot cross-domain split where the training set includes 32 cross-domain examples.For our IID evaluation, following Levy et al. (2022), we use the 8-S split.Additionally, we use the programs with the simplified syntax provided by (Meron, 2022).BREAK (Wolfson et al., 2020) is a dataset that maps complex natural language questions into a language-based meaning representation (QDMR) comprising an ordered list of atomic steps necessary to answer the question.Following (Rubin et al., 2022), we use the low-level Break subset where the targets are logical forms comprising lists of operators with their arguments based on the corresponding QDMR.MTOP (Li et al., 2021): A multilingual taskoriented semantic parsing dataset spanning six languages and 11 domains.The target commands are complex queries featuring nested intent-slot prediction.We use the English subset of MTOP from (Rubin et al., 2022).

B.2 Non-Semantic Parsing
We additionally experiment with the standard IID splits of 9 non-semantic parsing datasets from the following categories: Numerical Reasoning: For this category, we experiment with GSM8K (Cobbe et al., 2021), a chain-of-thought reasoning (Wei et al., 2023) dataset of grade school-level arithmetic reasoning problems expressed in natural language and DROP (Dua et al., 2019), a dataset of question-answer pairs where the questions are about paragraphs containing numerical information and the answers are spans in the paragraph.

C Selection Time
Despite their O(LT d) time complexity, we found example selection using both BSR and SET-BSR to be fast enough to not be a bottleneck to incontext learning for most datasets considered in this work.By using a GPU to compute c(x, z)s, we could get both to work in the order tens of milliseconds per test input on average which was significantly faster than the LLM inference time itself.The exceptions were DROP, PAWS, QQP, MNLI and QNLI for which the selection took >1 second due to much longer instances and/or larger instance pool.We leave it to future work to explore more efficient ways to measure informativeness.

D Additional Analyses
BM25 From Fig. 6 we can see that coverage-based selection using BM25 with larger substructures outperforms vanilla unigram BM25 in compositional splits.
BERTScore-Recall Examining the impact of importance weighting in Fig. 8 which compares the performance change with using importance weighting (IDF) in BSR, we can see that its effect is not consistent across different LLMs.We also did not see any consistent improvement from using larger deberta-large-mnli for computing token embeddings for instance-level BSR (see Fig. 9).However, it did help with set-level selection using SET-BSR.
Reordering We found the reordering of demonstrations according to the corresponding instance-level metric to only be necessary for smaller LLMs (see Fig. 10), with it even hurting the performance of larger LLMs.We believe this is because larger and code-pretrained LLMs are more capable at composing the salient aspects in the different demonstrations and taking advantage of the full context.deberta-large-mnli outperforms Cosine with all-mpnet-base-v2.Tables 15, 16, 17, and 18 show that the same trend holds even when using the same encoder, bert-base-uncased, for both metrics confirming that contextual embeddings are indeed better at capturing salient aspects.

Recall of Syntactic Structures
The improvements from set-based selection may be explained by Fig. 11 where we see that set-extensions COSINE and unigram BM25 reduce the recall of substructures of the test input whereas the recalls increase with set-extensions of both BM25[4-GRAM] and BM25[4-DEPST], and even BSR, which does not explicity consider these substructures.

E Qualitative Analysis of Prompts
Tables 7, 8 show demonstrations selected using COSINE and SET-BSR for instances from MTOP and SMCalFlow-CS respectively.In each case, CO-SINE find demonstrations that are all very similar to the test input but fails to demonstrate some salient aspect, whereas BSR selects less similar instances but ensures complete coverage of all salient aspects.Tables 9 and 10 additionally illustrate limitations of set-selection and of token-embeddings in capturing salient aspects.

F All Results
Tables 11 contains 8-shot ICL results for our proposed methods and prior learning-free and learningbased demonstration selection on all the LLMs for all the semantic parsing datasets.For numerical reasoning and classification datasets, Tables 12 and 13 compare 8-shot ICL performance with prior training-free and trained methods, respectively.Table 14 provides average performances across all datasets.
Since SET-BSR prioritizes coverage of the remaining aspect, it selects an example that has exactly three items whose total length has to be computed but overall is not very similar in reasoning.BSR on the other hand tries to find an example that demonstrates all aspects by itself and happens to find one that partially demonstrates the remaining aspect as well.
Selector Prompt

BSR
Begun in 1960 and opened to traffic in 1968 , the bridge is a two -tiered road and rail design spanning 4 ,600 metres on the upper deck , with approximately 1 ,580 metres spanning the river itself .Can we know " What type of design is the bridge ?"? Yes The BBC also introduced Ceefax , the first teletext service , starting in 1974.Can we know " What kind of service was Ceefax ?"? Yes The Table 10: Top four demonstrations selected by different methods for the QNLI input: Telenet was incorporated in 1973 and started operations in 1975.Can we know "What was telenet"?Since BSR doesn't have access to the labels and also cannot reason about the inputs themselves, it cannot account for the fact that the context in the test input does not contain the answer for the question and selects demonstrations that are all answered "Yes" even though the answer to the test input is "No".Table 15: 8-shot ICL results with GPT-Neo-2.7Bfor all ablations of learning-free methods on semantic parsing datasets and splits.
Table 19: 8-shot ICL results with GPT-3.5-Turbo for all ablations of learning-free methods on semantic parsing datasets and splits.

Figure 1 :
Figure 1: (a) Test input with salient aspects highlighted.(a) Independently selecting similar examples leads to redundancy and failure to demonstrate all salient aspects, in this case, the need to identify the manager.(b) Coverage-based selection using SET-BSR mitigates this by selecting a less similar example that contains the missing information.Blue indicates LLM generation.

Figure 2 :
Figure2: Gain in average ICL accuracy compared to COSINE on IID and COMPositional splits in semantic parsing.Trained methods (EPR and CEIL) become less effective with larger LLMs on IID splits.This is unlike SET-BSR, which, on compositional splits, even becomes more effective with larger LLMs.

Figure 3 :
Figure3: Change in average performance on different types of splits of semantic parsing datasets from set-selection using our set metrics v/s the corresponding instance-level metric.Coverage-based set selection is most useful in compositional splits and when covering larger syntactic structures (BM25) or contextual embeddings (BSR).

Figure 4 :
Figure 4: Average performance on IID and COMP semantic parsing splits with Codex.SET-BSR consistently outperforms independent selection.

Figure 5 :
Figure5: Demonstrations selected for a GeoQuery input (outputs omitted for clarity).COSINE demonstrations are redundant (repeated operations) and limited (only cover "population" aspect).SET-BSR, instead, selects demonstrations that are similarly complex as the test instance and, together, cover all required operations.

Figure 6 :
Figure6: Absolute improvement in average 8-shot ICL performance on different types of semantic parsing splits from using the set extensions SET-BM25 with larger substructures over vanilla BM25.

Figure 7 :
Figure7: Comparison of 8-shot ICL performance of different variants of BERTScore with token embeddings computed using deberta-base-mnli.For easier visualization, since we found BERTScore-Precision to consistently perform worst, we show absolute improvement in average performance on different types of splits from the recall and F1 metrics over the precision metric.

Figure 8 :
Figure8: Impact on average 8-shot ICL performance on semantic parsing splits from using importance weighting (IDF) in BSR.

Figure 10 :
Figure10: Impact on average 8-shot ICL performance on semantic parsing splits from reordering the demonstrations selected by the different set-level metric using the corresponding instance-level metric as absolute gain v/s the unreordered version.
Water , Sanitation and Hygiene ( WSH ) program of the Gates Foundation was launched in mid -2005 as a " Learning Initiative ," and became a full -fledged program under the Global Development Division in early 2010.Can we know " What was the WSH program launched in 2005"?Yes Television broadcasting in Hyderabad began in 1974 with the launch of Doordarshan , theGovernment of India 's public service broadcaster , which transmits two free -to -air terrestrial television channelsand one satellite channel .Can we know " What is Doordarshan ?"? Yes

Figure 11 :
Figure 11: Coverage of aspects of the test instance: Change in recall of unigrams, 4-grams, and dependency parse subtrees (size < 4) in the test input with set-selection of demonstrations, compared to their nonset version, averaged across all datasets.
Algorithm 1 Greedy Optimization of Set Coverage

Table 1 :
Average 8-shot ICL performance across all splits of semantic parsing datasets using different LLMs and demonstration-selection methods with absolute improvement over COSINE in brackets.Both BSR and SET-BSR outperform prior training-free methods, with the latter outperforming even trained methods with larger LLMs.
and concatenate to form the prompt.For set-selection methods, the demonstrations are or-

Table 2 :
8-shot ICL accuracy on SMCalFlow-CS using Codex with absolute improvement over COSINE in brackets.SET-BSR is competitive with trained methods on the IID split while dramatically outperforming them on the compositional split.
Table 1 compares average performance across all semantic parsing splits for seven LLMs of varying sizes.See Table 2 for comparison with LFCOV, which only works with GeoQuery and SMCalFlow-CS and Table 11 for results on individual splits.While BSR consistently outperforms COSINE and BM25 for all LLMs, set-selection using SET-BSR leads to further dramatic gains with upto 17% improvement over COSINE with Codex, beating even state-of-the-art trained methods like EPR and CEIL by 12 and 6 points, respectively.Further, from

Table 4 :
Average 8-shot ICL performance with Codex on IID, COMPositional, and ALL semantic parsing splits.Top compares different variants of BERTScore, white Bottom compares the different variants of BM25.
Cosine even with the same encoder In § 7.2, we showed that BSR with

Table 5 :
Semantic Parsing Datasets with corresponding sample instances and example templates used in for ICL.

Table 6 :
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May.How many clips did Natalia sell altogether in April and May? solution: Natalia sold 48/2 = «48/2=24»24 clips in May.Natalia sold 48+24 = «48+24=72»72 clips altogether in April and May.#### 72 To start the season, the Lions traveled south to Tampa, Florida to take on the Tampa Bay Buccaneers.The Lions scored first in the first quarter with a 23-yard field goal by Jason Hanson.The Buccaneers tied it up with a 38-yard field goal by Connor Barth, then took the lead when Aqib Talib intercepted a pass from Matthew Stafford and ran it in 28 yards.The Lions responded with a 28-yard field goal.In the second quarter, Detroit took the lead with a 36-yard touchdown catch by Calvin Johnson, and later added more points when Tony Scheffler caught an 11-yard TD pass.Tampa Bay responded with a 31-yard field goal just before halftime.The second half was relatively quiet, with each team only scoring one touchdown.First, Detroit's Calvin Johnson caught a 1-yard pass in the third quarter.The game's final points came when Mike Williams of Tampa Bay caught a 5-yard pass.The Lions won their regular season opener for the first time since 2007 question: How many points did the buccaneers need to tie in the first?Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese.question: When did the third Digimon series begin?Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.hypothesis: Christopher Reeve had an accident.Non-Semantic Parsing Datasets with corresponding sample instances and example templates used for ICL.

Table 7 :
Demonstrations selected for the MTOP input: Vegan desert options with target output [IN:GET_RECIPES [SL:RECIPES_TYPE Vegan ] [SL:RECIPES_DISH birthday cakes ] ].COSINE's reliance on a single dense embedding means it is unable to account for the fact that "options" could mean dishes and not just recipes.

Table 8 :
Demonstrations selected for the SMCalFlow-CS input: Schedule a meeting with Elli and her manager 's boss tomorrow morning.SET-BSR is able to find demonstrations covering all fragments of the test input while COSINE fails to include anything which involves finding someones manager.Justin has a box that is 12 inches in height .The length of the box is 3 times its height and 4 times its width .What is the volume of the box ?Question : John builds a box .The box is 26 inches by 26 inches by 14 inches .The walls are 1 inch thick on each side .How much is the internal volume in cubic feet ?BSRQuestion : A window is made up of 8 glass panes .Each pane has a length of 12 inches and a width of 8 inches .What is the area of the window ?Question : John builds a box .The box is 26 inches by 26 inches by 14 inches .The walls are 1 inch thick on each side .How much is the internal volume in cubic feet ?: Jazel has 3 sticks .One stick is 3 centimeters long .The second stick is twice as long while the third stick is 1 centimeter shorter than the second stick .What is the total length of Jazel ' s sticks when they are put together ?Question : John builds a box .The box is 26 inches by 26 inches by 14 inches .The walls are 1 inch thick on each side .How much is the internal volume in cubic feet ? Question

Table 9 :
Demonstrations selected by different methods for the GSM8K input: John has 3 boxes.Each box is 5 inches by 6 inches by 4 inches.The walls are 1 inch thick.What is the total inner volume of all 3 boxes?We only show the inputs for clarity.Only BSR solves this input (2-shot ICL with Codex).

Table 11 :
Comparison of various methods on 8-shot ICL for semantic parsing datasets.Right-most column shows average performances on ALL, only IID, and only COMPositional splits.BSR consistently outperforms COSINE and BM25 and SET-BSR yields further gains surpassing even trained methods.

Table 12 :
Comparison of various methods on 8-shot ICL for reasoning and classification datasets.BSR outperforms all prior training-free methods though SET-BSR doesn't yield additional improvement.

Table 13 :
Additional non-semantic parsing 8-shot ICL results for comparison with trained methods and larger LLMs.BSR is competitive with EPR and CEIL, even outperforming them with larger LLMs.We could not get CEIL to work for DROP.

Table 14 :
Average 8-shot ICL performance across all datasets and splits.