Skill-Based Few-Shot Selection for In-Context Learning

In-context learning is the paradigm that adapts large language models to downstream tasks by providing a few examples. Few-shot selection -- selecting appropriate examples for each test instance separately -- is important for in-context learning. In this paper, we propose Skill-KNN, a skill-based few-shot selection method for in-context learning. The key advantages of Skill-KNN include: (1) it addresses the problem that existing methods based on pre-trained embeddings can be easily biased by surface natural language features that are not important for the target task; (2) it does not require training or fine-tuning of any models, making it suitable for frequently expanding or changing example banks. The key insight is to optimize the inputs fed into the embedding model, rather than tuning the model itself. Technically, Skill-KNN generates the skill-based descriptions for each test case and candidate example by utilizing a pre-processing few-shot prompting, thus eliminating unimportant surface features. Experimental results across five cross-domain semantic parsing datasets and six backbone models show that Skill-KNN significantly outperforms existing methods.


Introduction
In-context learning (Brown et al., 2020) has become a prevailing paradigm for utilizing large language models (LLMs) (Hendrycks et al., 2020;Patel and Pavlick, 2021;Rae et al., 2021;Zhang et al., 2022a;Hoffmann et al., 2022;Srivastava et al., 2022;Chowdhery et al., 2022;Smith et al., 2022;Wei et al., 2022a).It employs a frozen taskagnostic backbone model to serve various downstream tasks without requiring parameter updates for each task.Under in-context learning, the LLMs generate output for an input query by conditioning on the prompt that contains input-output examples.Due to limited context length of the language model, only a few examples can be presented in the prompt.Prior studies have found that the performance of in-context learning is sensitive to the selected in-context examples (Liu et al., 2022;Zhang et al., 2022b;Chen et al., 2023b).Therefore, one essential research question is: how to select proper examples from a large example bank?
Raw-input-based selection is one widely applied solution (Gao et al., 2021;Liu et al., 2022;Hu et al., 2022).It involves embedding raw inputs of examples using an off-the-shelf embedding model and then selecting the most similar examples.It can be conveniently applied in various downstream tasks.However, this method can be easily biased by surface natural language features that are not important for the target task.For instance, in semantic parsing tasks 1 , the raw-input-based selection just finds out examples with similar entities (as illustrated in Figure 1a), while the better in-context examples should contain the required executable operations in logical forms, which can be regarded as the task-specific skills.
To overcome this limitation, we aim to make this embedding-based selection better aware of the intrinsic skills behind the raw inputs.We consider to harness the power of prompting LLMs to convert the desired skills from raw inputs, which maintains the training-free advantage during selection.There has been much work trying to fine-tune the embedding model for each task based on the example bank (Rubin et al., 2022;Poesia et al., 2022;Hu et al., 2022;Ye et al., 2023).However, fine-tuning-based methods are difficult to apply in practical scenarios: it is laborious to train and save the embedding model for each task, and it is also inconvenient to re-train the model on a dynamic example bank that can be updated frequently.
Specifically, we introduce SKILL-KNN, a training-free, skill-based selection method (briefly illustrated in Figure 1b).Overall, SKILL-KNN will first generate skill-based descriptions from raw input queries, then feed these descriptions into an off-the-shelf embedding model to select most similar examples.To generate skill-based descriptions, we prompt a frozen LLM with just a few human-1 Semantic parsing means to parse an NL utterance into a machine-understandable logical form (e.g., a SQL query).annotated demonstrations, which does not require any fine-tuning process and has no rule-based constraint.Additionally, to alleviate the sensitivity to the order of annotated demonstrations during generation, we design two variants of SKILL-KNN: we sample a set of candidate descriptions by shuffling annotated demonstrations, then select candidate based on consistency and distinctiveness, respectively.
The experimental results show that SKILL-KNN brings a considerable boost for in-context learning compared to the raw-input-based selection.We evaluate SKILL-KNN on five challenging semantic parsing datasets: Spider (Yu et al., 2018b), Dr. Spider (Chang et al., 2023), KaggleDBQA (Lee et al., 2021), BIRD (Li et al., 2023c), and COGS (Kim and Linzen, 2020).We take six models for in-context learning: text-chat-davinci-002, codedavinci-002, text-davinci-003, code-cushman-002, gpt-35-turbo, and gpt-4.Across these tasks and models, SKILL-KNN consistently performs best among non-oracle selection methods and, at times, is even comparable to oracle methods.For instance, with text-chat-davinci-002, SKILL-KNN achieves 78.3% execution accuracy on Spider, while the best raw-input-based selection method reaches 74.6% and Target-KNN2 attains 78.6%.Furthermore, our ablation study indicates that SKILL-KNN retains its superiority when constraints are imposed on the annotated demonstrations, including reducing the number of demonstrations, restricting the database diversity, and decreasing the operation coverage.
Our contributions are three-fold: 1) we propose a skill-based few-shot selection method SKILL-KNN, which leverages the power of prompting LLMs to generate skill-based descriptions; 2) we design two variants of SKILL-KNN based on consistency and distinctiveness, respectively; 3) our comprehensive experiments across various semantic parsing tasks and backbone models demonstrate the effectiveness of SKILL-KNN, and our analysis of annotated demonstrations provides further insights for better utilization of SKILL-KNN.

Preliminaries
In this section, we introduce embedding-based fewshot selection as the preliminary of our method.

In-Context Learning with Few-Shot Selection
Consider a downstream task T that contains a set of input-output examples {(x i → y i )} n (termed example bank B) 3 , and a pre-trained large language model with frozen parameters θ.Given a test input query x t , the large language model with in-context learning will generate an output y t by sampling from the following distribution, in which τ is the sampling temperature, R(x t , B) returns a sequence of examples selected from B according to x t , and ⊕ means to sequentially concatenate two sequences.In the later part of the paper, we omit the frozen θ and set τ = 0 by default, which means to perform greedy decoding.
Few-shot selection aims to design the algorithm R(x t , B) that can work well for task T .

Embedding-Based Few-Shot Selection
A standard implementation of R(x t , B) is to leverage an off-the-shelf embedding model Emb(•) and calculate the embedding similarity of raw inputs (Liu et al., 2022), 3 In semantic parsing tasks, each input query contains a natural language question along with the database schema.
in which x i is the input4 of one example (x i , y i ) ∈ B. Based on Equation 2, we can select k most similar examples from B. In addition, these examples will be sorted in the prompt according to their similarities to the test input query: examples with higher similarity scores will be placed closer to the test input query.
This standard implementation of R(x t , B) is a raw-input-based selection.It just searches for examples with similar inputs (i.e., the x t and x i in Equation 2).Some recent researches propose to fine-tune the embedding model (from Emb(•) to Emb ′ (•)) (Rubin et al., 2022;Poesia et al., 2022;Hu et al., 2022;Ye et al., 2023).In this paper, we want to explore how to improve the effectiveness of few-shot selection without training or fine-tuning of any models.

SKILL-KNN
SKILL-KNN involves a rewrite-then-retrieve process to better exploit the potential of in-context learning.Figure 2 gives a bird's-eye view of our method.To mine and utilize task-specific skills, SKILL-KNN contains a prompting-based rewriting stage and a skill-based selection stage.Prompting-based rewriting will prompt LLMs to generate skill-based descriptions from the given input query.Skill-based selection will return fewshot examples based on these generated descriptions.In the following, we elaborate the design of SKILL-KNN.

Generating Skill-Based Descriptions
We prompt a frozen large language model to rewrite each input query as a skill-based description, which does not require any fine-tuning process.Specifically, we first annotate the skill-based descriptions for 16 examples in B, then prompt the large lan-guage model with these annotated demonstrations and generate for other examples in B and for each test input query.
Note that we annotated skills with natural language descriptions rather than rule-based constraints.It is important to note that off-the-shelf embedding models are primarily pre-trained on natural language (NL) data and may not be well-suited for handling specifically designed structural constraints.By annotating skills with NL descriptions, we can better align with the off-the-shelf embedding models, which in turn allows us to leverage their generalizability when encoding unannotated NL descriptions more effectively.Thus, these natural language skills can better suit the off-the-shelf embedding model, and our annotated demonstrations can better generalize to unseen data.
Formally, with a set of annotated demonstrations {x a → s a } in which s a is the annotated skill-based description for the raw input x a , we generate the s i for each unannotated input x i by prompting the large language model, then, these descriptions are fed into the off-theshelf embedding model to select similar examples, Table 1 shows part of our annotated demonstrations for text-to-SQL tasks, and all our annotations are contained in Appendix E. Note that different textto-SQL tasks can share the same set of annotated demonstrations in our experiments.Equation 4 defines the basic version of SKILL-KNN.Moreover, we notice that the generated skillbased descriptions sometimes could be sensitive to the order of annotated demonstrations.Such a sensitivity is also observed in some previous work (Zhao et al., 2021;Lu et al., 2022).Therefore, we design two variants of SKILL-KNN to further address this sensitivity issue.

Variants
To alleviate the influence from the sensitivity to prompt order, we design two variants of SKILL-KNN that change the order of annotated demonstrations and perform rewriting multiple times.Specifically, for each input x i , both two variants generate a set of candidate descriptions S i = {s 1 i , s Emb(s j ), (5) in which S t and S i represent two sets of candidate descriptions for the test input x t and one example (x i , y i ) ∈ B, respectively.This variant is inspired by prior work on improving the consistency of chain-of-thought reasoning (Wang et al., 2022;Li et al., 2022a).As illustrated in the left in Figure 3, Equation 5 can be regarded as an embedding-level majority voting among all candidate descriptions during selection.
Distinctiveness-Based Variant.Considering that the central embedding can sometimes be overwhelmed by trivial candidates, we want to highlight the most distinctive and informative description among all candidates.Formally, we consider the maximum similarity score between two sets for selection, in which s j t ∈ S t and s k i ∈ S i .As illustrated in the right in Figure 3, Equation 6 means that we take the minimum distance among two set of candidates for selecting similar examples.

Experimental Setup
In this section, we will introduce the tasks, compared selection methods, backbone models, and

Tasks
We evaluate on five challenging cross-domain semantic parsing datasets.Due to the cross-domain property, the model can not easily solve these tasks by just copying some similar surface features from the provided in-context examples.
Spider (Yu et al., 2018b) is a large-scale text-to-SQL dataset.It contains a train set with 7,000 examples and a dev set with 1,034 examples.Moreover, the train set and dev set do not share any database.We take the train set of Spider as the example bank, and evaluate on the dev set.
Dr. Spider (Chang et al., 2023) is a diagnostic evaluation benchmark constructed based on Spider.It contains 15,269 examples which can be divided into 3 sub-tasks, according to the type of designed perturbations: database perturbations (DB), natural language question perturbations (NLQ), and SQL query perturbations (SQL).We take the train set of Spider as the example bank, since Dr. Spider is purely an evaluation benchmark.
KaggleDBQA (Lee et al., 2021) (KDBQA) is a small while complex dataset towards realistic evaluation of text-to-SQL semantic parsers.It contains 8 real Web databases with original formatting and Each case in BIRD is equipped with a description of the required external knowledge, which is not contained in above three text-to-SQL tasks.Since the database schema in BIRD is too large, we first take grounding to reduce the size of schema (detailed in Appendix D.4).
COGS (Kim and Linzen, 2020) is a synthetic benchmark for testing compositional generalization in semantic parsing.It can also be regarded as a cross-domain setting, containing a significant distribution shift between train set and test set.The logical form in COGS represents the thematic roles in the input query (detailed in Appendix D.3).We use the output format designed in An et al. (2023) and evaluate on two sub-tasks, primitive substitution (P.S.) and primitive structural alternation (P.A.).

Selection Methods
We mainly compare SKILL-KNN with trainingfree selection methods.
Random.We randomly select examples from B as in-context examples.For each test case, we take random selections 3 times and average the results.(Liu et al., 2022).We test three off-the-shelf embedding models: Sentence-BERT (SBERT) with all-mpnet-base-v2 checkpoint5 (Reimers and Gurevych, 2019), OpenAI embedding model6 with text-similarity-babbage-001 checkpoint (OpenAI Babbage), and OpenAI embedding model with text-embedding-ada-002 checkpoint (OpenAI Ada).KNN with OpenAI embedding models can serve as strong baselines for training-free selection methods, as these large models have been well pre-trained for judging text similarity (Neelakantan et al., 2022).

KNN
MMR (Ye et al., 2022) is a dynamic selection method to enhance the diversity of selected examples from KNN.It adds a penalty term according to the similarity to the already selected examples.We take OpenAI Ada for embedding and follow the implementation details in Ye et al. (2022).

SKILL-KNN (ours).
We test SKILL-KNN with SBERT and OpenAI Ada.For the base version of SKILL-KNN (i.e., without consistency or distinctiveness), we shuffle the order of annotated demonstrations to generate m = 5 skill-based descriptions for each input query and average the results.There is a balance between achieving optimal performance and minimizing computational costs.We provide more experimental analysis in Appendix B.2.For two variants, we take all 5 generated descriptions as the candidate set.In addition, we also compare with two oracle methods, in which ground truth output sequences are allowed to be leveraged for few-shot selection.

Target-KNN (oracle).
We select examples with similar output embeddings.We use OpenAI Babbage and OpenAI Ada to encode the ground truth, and take the best result of two models for each task.
Target Sketch Matching (oracle).We select incontext examples with similar sketches of ground truth.For text-to-SQL tasks, we calculate the overlap of SQL key words (detailed in Appendix D.6).For COGS, we follow the target-side structural similarity setting in An et al. (2023).

Backbones and Hyper-parameters
We conduct experiments with six OpenAI language models as the backbones8 : text-chat-davinci-002, code-davinci-002, text-davinci-003, code-cushman-002, gpt-35-turbo, and gpt-4.For generating skillbased descriptions, we always use gpt-3.5-turbo,as it is cheap and fast.We select k = 4 in-context examples in all experiments.We use executionwith-values accuracy9 as the evaluation metric for text-to-SQL tasks and exact-match accuracy for COGS.

Main Results
Table 2, Table 3, and Table 4 report the main experimental results.We also count the number of wins, i.e., how many tasks (and sub-tasks) the method performs best on.SKILL-KNN performs better than raw-inputbased selection methods.Across all backbone models and tasks, our skill-based selections achieve the best performance among non-oracle methods.Especially, SKILL-KNN with SBERT can even outperform KNN with OpenAI embedding models.These results clearly demonstrates the necessity and effectiveness of our prompting-based rewriting.Appendix A.2 contains more experimental comparisons with existing selection methods.SKILL-KNN performs comparable/better than fine-tuning-based method.Results in Table 4 show that SKILL-KNN can perform comparable or even better than fine-tuning-based methods.It demonstrates that optimizing the input to the embedding model can also effectively help downstream tasks without any fine-tuning.

Variants with consistency and distinctiveness can outperform the base version of SKILL-KNN.
As shown in the #Wins column in  Bareiß et al., 2022;Li et al., 2022b;Chen et al., 2023a;Li et al., 2023b), arithmetic reasoning (Wei et al., 2022b;Wang et al., 2022;Li et al., 2022a;Shi et al., 2023;Qin et al., 2023), and semantic parsing.From the view of leveraging skills for incontext learning, most existing work considered explicitly injecting symbolic systems into the response of the model (Cheng et al., 2023;Creswell et al., 2023;Schick et al., 2023;Shen et al., 2023;Lu et al., 2023).This work aims to uncover the intrinsic skills from the raw inputs of examples.
Semantic parsing with deep learning methods has been explored in much existing work (Dong and Lapata, 2016;Yu et al., 2018a;Xu et al., 2017;Guo et al., 2019;Zhong et al., 2020;Wang et al., 2020;Lin et al., 2020;Scholak et al., 2021;Qi et al., 2022;Li et al., 2023a).Under the recent in-context learning paradigm, there have been some preliminary observations: Task type.We mainly evaluate SKILL-KNN on cross-domain semantic parsing tasks, and we believe it can also help other challenging tasks where some intrinsic task-specific skills are needed.However, for tasks that require only surface feature similarity of in-context examples, we suppose the advantage of SKILL-KNN could be diminished.
Individual variants.We design two variants of SKILL-KNN based on consistency and distinctiveness, respectively.An ideal variant should take into account both these two aspects.We take this as a future direction for our work.

Ethics Statement
Due to the use of pre-trained language models, this work can be exposed to potential ethical risks associated with general deep learning models, such as social bias and privacy breaches.We suppose this work would be helpful to alleviate potential ethical issues for in-context learning as it can better overcome the surface-form biases in example bank.This is the Appendix of the paper: Skill-Based Few-Shot Selection for In-Context Learning.

A More Experimental Results
We report more experimental results with SKILL-KNN.
A.1 Recall@N Performance of SKILL-KNN Besides the performance of greedy-decoding, we also evaluate the top-k recall performance of SKILL-KNN.Since we cannot take beam search with OpenAI interfaces, we implement the top-N selection with sampling and re-ranking, following the self-consistency setting (Wang et al., 2022).Specifically, we first sample 100 sequences and then select the k most frequently occurring sequences and evaluate their execution accuracy.
We take the text-chat-davinci-002 as the backbone model and set the sampling temperature as 0.7.We evaluate SKILL-KNN with distinctiveness on Spider dev set.Table 6 shows that the recall rate gradually goes up with increasing N and can even achieve higher than 90% when we set N = 10.Moreover, the top-1 in Table 6 is 80.3% which is higher than the greedy-search performance (78.3% in Table 2).It means that our SKILL-KNN can obtain further gains from ensemble methods such as self-consistency.

B More Analysis B.1 Motivation Behind Two Variants of SKILL-KNN
The motivation behind the two variants stems from our consideration of disturbances from the promptorder sensitivity as additive noises to the groundtruth skills during the rewriting process.The two variants are designed to address two types of noise: Zero-mean white noise which frequently occurs in results and originates from a zero-mean distribution (e.g., zero-mean Gaussian distribution).We assume its magnitude is relatively small compared to the ground truth.Zero-mean white noise can cause the loss or redundancy of partial information in the ground truth.
Spike noise which occasionally occurs in results and has a much larger magnitude than the ground truth.It strongly influences the information in the ground truth and causes outliers.
The consistency-based variant is more effective at addressing zero-mean white noise, as the averaging operation reduces the variance of zeromean noise.The distinctiveness-based variant is better suited for handling spike noise, as it mitigates the influence of outliers.The final results in our Table 2 indicate that both types of noise occur in our LLM-based rewriting, as evidenced by the close win times of the two variants (19 vs 15).
To further support that two variants are better at tackling two different types of noise, we conduct additional analysis from two perspectives.Evaluation of selection performance.We examined the performance of each variant in selecting better examples from the example bank under the two noise patterns.We construct some synthetic data for automated evaluation: we take 1,000 unique embeddings of skill-based descriptions as ground truth, then we add noise on each groundtruth embedding to construct the sample set, and finally we assess whether each sample set could correctly select the original ground-truth embedding.The accuracies are shown in Table 8.The results of the above analysis further evidence that the consistency-based variant performs better than the distinctiveness-based variant under the zero-mean white noise but performs worse under the spike noise.

B.2 The Choice of Hyper-Parameter m
Table 9: Performance with m = 3/5/7/9 (Dataset: Spider, LLM: code-cushman-002, variant: consistency).The two variants of SKILL-KNN require to generate m candidate skill-based descriptions.In our experiments, we set m = 5 as a trade-off between achieving optimal performance and minimizing computational costs.During our initial exploration, we had experimented with m = 3/5/7/9.As shown in Table 9, the performance improves marginally when m > 5. Therefore, we decided to set m = 5.

B.3 Measuring Diversity of SKILL-KNN and
Oracle methods A surprising observation in Table 2 is that with the same backbone model for in-context learning, our SKILL-KNN could sometimes outperforms oracle methods.Specifically, SKILL-KNN consis-tently outperforms at least one oracle method on the DB sub-task in Dr. Spider and two sub-tasks in COGS (marked with underlines).Such an observation could be caused by the different diversity in selected examples.As indicated in Levy et al. (2022) and An et al. (2023), beyond the similarity to the test case, a higher diversity among incontext examples could also help better perform cross-domain generalization under in-context learning.Oracle methods directly seek higher similarity, thus the selected examples may be less diverse, which could slightly hamper the generalization performance.To reflect the diversity of in-context examples, we count the number of different databases among selected examples.Statistics in Table 10 shows that SKILL-KNN can lead to a higher diversity than oracle method, which is in line with our hypothesis.

B.4 Why do skill-based descriptions perform better?
Since both SKILL-KNN and raw-input-based methods use the embedding similarity for selection, we suppose that the higher performance of SKILL-KNN can be contributed by some desired properties in the embedding space of skill-based descriptions.Based on this inspiration, we visualize the embedding space of both raw input queries and skill-based descriptions with t-SNE (Van der Maaten and Hinton, 2008).More details are contained in Appendix D.7.
Under the embedding space of skill-based descriptions, the distribution of test cases is closer to that of the example bank, thus benefiting cross-domain generalization.For the raw-inputbased embeddings of test cases, Figure 6a shows that these embeddings are mainly centralized in some local parts.On the one hand, the example bank can not be fully utilized under this embedding space, since the top-k similar examples must be around the local parts of test cases.On the other hand, the different distributions represent that this space does not reveal the inner similarity between test cases and example bank, thus is helpless to facilitate cross-domain generalization.Under the skill-based embedding space (shown in Figure 6b), the distributions of test cases and the example bank are better matched.Therefore, the cross-domain generalization gap can be better bridged with skillbased descriptions.

C Case Study
Considering that the quality of generated skillbased description can be one key factor that influences the effectiveness of SKILL-KNN, we manually check the generated skills for 100 examples.We find that 86/100 generated skills are exactly correct; 12/100 are almost correct but need some partial modifications (e.g., the number of joined tables); and only 2/100 generated skills are totally wrong.Moreover, during manually checking the quality of generated skills, we surprisingly find that there are some novel descriptions about skills that are not presented in our annotated examples.Table 11 shows some examples.It indicates that the prompting-based rewriting can provide a degree of generalization of unannotated skills.

D Detailed Settings of Experiments
In this section, we provide more details about our experimental settings.

D.1 Select Examples for Annotation
In our default setting, we consider two principles to select examples for annotating the required skills: 1) ensuring coverage of all logical operations found in the example bank and 2) selecting examples from diverse databases.Specifically, we first find all used logical operations in the example bank and Table 11: Case study for generated skill-based descriptions in Spider.Texts in purple are novel descriptions that are not presented in our annotations.
Input Query Generated Skill SQL display the emails of the employees who have no commission percentage and salary within the range 7000 to 12000 and works in that department which number is 50.
To solve this task in the database, we need to select one column, apply an interval constraint, apply a null constraint, and apply an equality constraint.
SELECT In our ablation study shown in Figure 5, to constrain the operation coverage, we just remove our first selection step; to constrain the database diversity, we just select examples from two databases.

D.2 Inference Hyper-Parameters
During inference, we set the max decoding length to 200, and the sampling temperature to 0.

D.3 Input-Output Formats
Figure 8 shows some input-output examples to illustrate the data formats in our experiments.
Note that the output format of COGS follows the transformation in An et al. (2023) which converts the original long-chain format into a more compact function-calling format.Such a transformation is similar to the conversion from Lambda calculus to FunQL in Geo domain (Zelle and Mooney, 1996;Kate et al., 2005;Zettlemoyer and Collins, 2012).It improves the human readability by omitting two types of details in original format: the special marker for definite descriptions and the Skolem constants.Apart from the omitted details, this transformation keeps the main semantics in the domain of COGS, such as semantic roles, modifications, and orders among clauses and modifications.

D.4 Evaluation on BIRD
Different from other text-to-SQL tasks, BIRD additionally provides "evidence" for each natural lan-guage question.Therefore, we add the evidence as part of the context for in-context learning.For evaluating raw-input-based methods on BIRD, we concatenate the natural language question and the additional evidence to compute the embedding.For our SKILL-KNN, we also provide the evidence for rewriting, and we use 12 annotated demonstrations with evidence (shown in Appendix E).
Since the database schema in BIRD is too large to be fully contained in the context for LLM, we reduce the size of schema through grounding in preprocessing.Specifically, we calculate the embedding similarity between the input question (along with the evidence) and each table name and column name.Based on this similarity, we preserve 8 tables each with 16 columns for each schema-question pair.

D.5 Evaluation on COGS
COGS totally contains 24,155 examples in train set and 21,000 examples in gen set.To reduce the high computational cost, we sample 2,000 examples from the train set as the example bank for in-context learning, and sample 1,000 examples from two sub-tasks primitive substitution (P.S.) and primitive structural alternation (P.A.) which are defined in An et al. (2023).

D.6 Target Sketch Matching for SQL
As mentioned in Section 4.2, to select in-context examples with target sketch matching (oracle) in text-to-SQL tasks, we calculate the overlap of SQL key words between the example from example bank and the labeled SQL query of the test input query.We mainly consider the following SQL key words along with several operations for calculation: SELECT, WHERE, GROUP, HAVING, ORDER, DESC, ASC, LIMIT, JOIN, INTERSECT, EXCEPT, UNION, NOT, IN, OR, AND, BETWEEN, EXISTS, LIKE, DISTINCT, COUNT, AVG, MIN, MAX, SUM, CAST, CASE, WHEN, THEN, ELSE, END, IIF, REAL, FLOAT, NULL, STRFTIME, *, /, =, >, ,<, !, +, -, %.Based on these key words, the target sketch similarity between two SQL queries y t and y i is calculated as follows, sim k (y t , y i ) = |KW(y t ) ∩ KW(y i )|, (7) in which KW(•) returns a set of contained key words.

D.7 T-SNE Visualization
For the visualized embedding space in Figure 6, we use Sentence-BERT as the embedding model and take examples from both example bank and dev set in Spider.For SKILL-KNN, we take its consistency-based variant.To accelerate the visualization process, we just take examples with medium hardness (defined in Yu et al. (2018b)).We use the implementation of t-SNE from the sklearn library10 .We set the learning rate of t-SNE as "auto", init method as "random", and perplexity as 3.  To solve this task in the database, we need to join two tables, select one column, group these selections and count the number of selections in each group.

E Annotated Demonstrations
gas_company Show minimum, maximum, and average market value for all companies.
To solve this task in the database, we need to return the minimum value, the maximum value, and the average of values in the column.
culture_company Show the years, book titles, and publishers for all books, in descending order by year.
To solve this task in the database, we need to select three columns and sort them in descending order according to the values in one column.
product_catalog Which catalog contents have a product stock number that starts from "2"? Show the catalog entry names.
To solve this task in the database, we need to select one column and apply a constraint on the format of values in this column.
bike_1 What are the dates in which the mean sea level pressure was between 30.3 and 31?
To solve this task in the database, we need to select one column and apply a constraint that the values in another column should in a certain range. flight_1 What is the salary and name of the employee who has the most number of certificates on aircrafts with distance more than 5000?
To solve this task on the database, we need to join three tables, apply a greater-than constraint, group the selections and calculate the number of each group, sort the selections in descending order, and select the top result.
bike_1 What is the average longitude of stations that never had bike availability more than 10?
To solve this task in the database, we need to calculate the average value in one colum, apply an non-inclusion constraint with another set of selections, which need to group the selctions and find which groups have a maximum value greater than the threshold.hr_1 display all the information of employees whose salary is in the range of 8000 and 12000 and commission is not null or department number does not equal to 40.
To solve this task in the database, we need to give full information about selections, apply an interval constraint, and apply an optional constraint that two unequal judgments should be satisfied at least one.
bike_1 What are the ids of stations that have latitude above 37.4 and never had bike availability below 7?
To solve this task in the database, we need to exclude the selections in the second set from the first set: the first set of selections need to apply a greater-than constraint, and the second set of selctions need to group the selctions and find which groups have a minimum value lower than the threshold. bike_1 What are the names and ids of stations that had more than 14 bikes available on average or were installed in December?
To solve this task in the database, we need to return the union of two set of selections: the first set of selections need to join two tables, group the selections and find which groups have an average value greater than the threshold, and the second set of selections need to apply a constaint on the format of values.
storm_record Show storm name with at least two regions and 10 cities affected.
To solve this task in the database, we need to return the intersection of two set of selections: the first set of selections need to join two tables, group the selections and find which groups have a number of selections greater than or equal to the threshold, and the second set of selections need to join two tables, group the selections and find which groups have a sum of values larger than or equal to the threshold. formula_1 List the forenames of all distinct drivers in alphabetical order?
To solve this task in the database, we need to select distinct values in one column and sort these selections in ascending order according to the selected values.hr_1 display job ID for those jobs that were done by two or more for more than 300 days.
To solve this task in the database, we need to apply a greater-than constraint on the difference between two values, group the selections and find which groups have a number of selections greater than or equal to the threshold.small_bank_1 Find the names and total checking and savings balances of accounts whose savings balance is higher than the average savings balance.
To solve this task in the database, we need to select one column and add the values in another two columns, join three tables, and apply a greater-than constraint where the threshold is the average of another set of selected values.To solve this task in the database, we need to get two times, calculate the differences between two times, and calculate the average of these differences.Additionally, we need to join three tables and apply an equivalent constraint.To solve this task in the database, we need to apply an either-or constraint, sort the selected results in ascending order, and return the top one result.retails Which ship mode has more "deliver in person" instructions, rail or mail?
ship mode refers to l_shipmode; "deliver in person" instruction refers to l_shipinstruct = 'DELIVER IN PERSON' To solve this task in the database, we need to count the number of two values in ome column and return the value with a larger count.Additionally, we need to apply an equality constraint.
cookbook Which ingredient appeared the most in recipes?Calculate its amount of appearance in percentage.
ingredient appeared the most in recipes refers to MAX( COUNT(ingredient_id)); calculation = MULTIPLY(DIVIDE( COUNT(MAX(ingredient_id)), COUNT(ingredient_id)), 100) To solve this task in the database, we need to select one column and calculate one percentage number, join two tables, group the selected results, and sort the results in descending order according to the size of each group.Additionally, to calculate the percentage number, we need to cast the count of vaules into a float number, multiply this number by 100, and divide it by the another count.A sandwich was fed to a giraffe .This sentence is in the passive voice and has a prepositional phrase (i.e., 'to noun phrase') which describes the recipient of the verb.
Sophia was given a cookie by Emma .This sentence is in the passive voice, has an object and has a prepositional phrase (i.e., 'by noun phrase') which describes the agent of the verb.
Sophia liked a box on the cake .This sentence has a single object with a modification phrase.
Emma sold the drink beside a road to a zebra .This sentence has a direct object with a modification phrase and has a prepositional phrase (i.e., 'to noun phrase') which describes the recipient of th verb.
A lion ate .The verb 'ate' has no object.
Eleanor was offered the ball .This sentence is in the passive voice and has an object.
A box was helped .This sentence is in the passive voice and has no object or prepositional phrase.
The fish dreamed to walk .This sentence has an infinitive verb.
A cat lended a lawyer the cake .This sentence has an indirect object and a direct object.
The basket was handed to a cat by Emma .
This sentence is in the passive voice and has two prepositional phrases: the first one (i.e., 'to noun phrase') describes the recipient of the verb and the second one (i.e., 'by noun phrase') describes the agent of the verb.
Sofia thought that a pancake rolled .This sentence contains a clause in which the verb 'rolled' has no object.
Liam hoped that the boy wanted to dance .This sentence contains a clause that has an infinitive verb.
The cat gave Ethan a rose on the table .This sentence has an indirect object and a direct object with a modification phrase.
The duke was passed the shell on a table in the house by Emma .

Figure 1 :
Figure 1: In-context learning with different selection methods.(a) Examples from raw-input-based selection just share similar entities with the input query.(b) With the skill-based description, the selected examples contain the desired task-specific skills.

Figure 2 :
Figure 2: The bird's-eye view of SKILL-KNN, a rewrite-then-retrieve selection method to facilitate in-context learning with skill-based descriptions.

Figure 4 :
Figure 4: Performance of SKILL-KNN (base version) with different number of annotated demonstrations.
Figure 6: T-SNE visualization for the embedding space of (a) raw input queries and (b) skill-based descriptions.The orange points are from the dev set in Spider while the gray points are from example bank.

Figure 7 :
Figure 7: Two perspectives that reflect the complexity of selected in-context examples from two variants of SKILL-KNN.(a) The average number of tables contained in the database of each in-context example.(b) The average length of SQL queries (split by space).

Table 1 :
Part of our annotated skill-based descriptions for text-to-SQL tasks., lname, fname, age, sex, major, ...] To solve this task in the database, we need to select distinct values in the column.Count the number of different colleges that players who play for Columbus Crew are from.team[team id, name] country [country id, country name, capital, ...] match season [season, player, position, country, team, ...] ...To solve this task in the database, we need to join two tables and count the number of distinct values in the column.

Consistency-Based Distinctiveness-Based Candidate Skill-Based Representations Central Similarity Maximum Similarity
2 i , ..., s m i } according to Equation 3 by changing the order in

Table 2 :
Our main experimental results (%) across various LLMs and tasks.Numbers in bold are the best results across non-oracle methods, and results with underlines can outperform at least one oracle method.
hyper-parameters in our experiments.

Table 3 :
Performance of gpt-35-turbo and gpt-4 on Spider dev set.

Table 4 :
Comparison with the fine-tuning-based selection methods on Spider dev set.
275 unrestricted questions.Since ther is not much data in KDBQA, we take it as a pure test set and use the train set of Spider as the example bank.BIRD (Li et al., 2023c) is a large scale text-to-SQL dataset with real-world database contents.It has 9,428 training examples and 1,534 test cases in the dev set.

Table 5 :
Performance of SKILL-KNN and two baselines on GSM8K.

Table 2
SKILL-KNN is more robust to perturbations than raw-input-based selections.Results on Dr. Spider reflect the robustness towards perturbations in data.For instance, with text-chat-davinci-002, KNN with SBERT performs lower than the random baseline on two out of three types of perturbations, while all three versions of SKILL-KNN outperform the random baseline on all three perturbations.It indicates that SKILL-KNN leads to more robust in-context learning than raw-input-based methods.SKILL-KNN can be effective in the math reasoning task.Our study primarily evaluated the effectiveness of SKILL-KNN in the semantic parsing/code generation field.To further examine the generalizability of SKILL-KNN beyond semantic parsing, we have applied it to a challenging math reasoning task, GSM8K(Cobbe et al., 2021).Results in Table5evidence that SKILL-KNN can also be effective in tasks beyond semantic parsing.
Shin et al. (2021)showed that GPT-3 is better at generating English-like descriptions rather than the raw logical forms; Rajkumar et al. (2022) revealed that prompt design is essential for semantic parsing with Codex; Liu et al.(2023)showed that ChatGPT has a surprising zero-shot performance on Spider and its variants; Pourreza and Rafiei (2023) demonstrated that explicitly taking multiple stages for generating SQL leads to better in-context learning performance.These observations indicate that in-context learning has a great potential on solving semantic parsing tasks, and this work aims to further activate this potential from the view of improving few-shot selection.shot selection, which is a novel perspective to harness the power of large language models.
sification tasks; Wu et al. (2023) improved classification tasks through information compression; Ye et al. (2022) and Ye and Durrett (2023) mainly focused explanation-based tasks.This work proposes prompting extremely large models to facili-tate few-

Table 7 :
Comparison with more baselines on Spider dev set.

Table 8 :
The selection accuracy of different variants under different noise patterns.

Table 10 :
Number of different databases among selected examples.This can reflect the diversity of in-context examples selected by different methods.
email FROM employees WHERE commission_pct = "null" AND salary BETWEEN 7000 AND 12000 AND department_id = 50 What are the salaries and manager ids for employees who have managers?To solve this task in the database, we need to select two columns and apply a non-null constraint on the values in another column.SELECT salary , manager_id FROM employees WHERE manager_id != "null" What is all the information about the Marketing department?To solve this task in the database, we need to select all columns and apply a constraint on the values in one column.SELECT * FROM departments WHERE department_name = 'Marketing' greedily cover these operations in a few examples.Then, we randomly select more examples from various databases until there are 16 examples.
Table 12 lists 16 annotated demonstrations for textto-SQL tasks and Table 13 lists another 12 annotated demonstrations with evidence (which is required in BIRD).Table 14 lists 16 annotated demonstrations for COGS.Appendix D.1 introduces how we select these examples.

Table 12 :
Annotated demonstrations for text-to-SQL tasks.All these examples are from the example bank of Spider.

Table 13 :
Annotated demonstrations for text-to-SQL task with evidence.All these examples are from the example bank of BIRD.To solve this task in the database, we need to select one column, apply a less-than constraint, and return three results.
retail_complains Among the teenager clients who use Google account and Microsoft account, which group of client is more than the other?teenagerrefers to 13 < age < = 19; Google account refers to email like '%@gmail.com';Microsoftaccountrefersto email like '%@outlook.com'To solve this task in the database, we need to compare the number of values in two formats, and return the value that has a higher number.Additionally, we need to apply a between-and constraint.To solve this task in the database, we need to select distinct values from one cloumn, join two tables, group the results, order the results in descending order according to the sum of values, and return the top three results.Additionally, during getting the sum of values, we need to multiply the values in one column with percentages in another column and divide the results by 100.To solve this task in the database, we need to compare values in two columns and convert the comparison result into a percentage.Additionally, we need to join two tables, apply a constraint on the format of values in one column, and ensure that the values in two columns are not null.Moreover, to get the percentage number, we need to cast the sum of values into a real number, multiply this number by 100, and divide it by the total count.retail_complains Between 1/1/2017 and 4/1/2017, what is the average server time of calls under the server DARMON? between 1/1/2017 and 4/1/2017 refers to Date received between '2017-01-01' and '2017-04-01'; average server time refers to avg(ser_time) To solve this task in the database, we need to calculate the average of the selected values and apply a between-and constraint.Additionally, to obtain the times from values in text format, we need to extract substrings from these texts and cast them into real numbers.soccer_2016 When did Chennai Super Kings play its first match?match date refers to Match_Date; Chennai Super Kings refers to Team_Name = 'Chennai Super Kings'; first match refers to min(Match_Date)

Table 14 :
Annotated demonstrations for COGS.All these examples are from the example bank of COGS.This sentence contains a clause in which the verb 'saw' has no object.