Self-ICL: Zero-Shot In-Context Learning with Self-Generated Demonstrations

Large language models (LLMs) have exhibited striking in-context learning (ICL) ability to adapt to target tasks with a few input-output demonstrations. For better ICL, different methods are proposed to select representative demonstrations from existing training corpora. However, such settings are not aligned with real-world practices, as end-users usually query LMs without access to demonstration pools. In this work, we introduce Self-ICL -- a simple framework which bootstraps LMs' intrinsic capabilities to perform zero-shot ICL. Given a test input, Self-ICL first prompts the model to generate pseudo-inputs. Next, the model predicts pseudo-labels for the pseudo-inputs via zero-shot prompting. Finally, we perform ICL for the test input with the pseudo-input-label pairs as demonstrations. Evaluation on 23 BIG-Bench Hard tasks shows Self-ICL outperforms zero-shot baselines on both average accuracy and head-to-head comparison. Moreover, with zero-shot chain-of-thought, Self-ICL achieves results comparable to using real demonstrations. Additionally, we conduct a range of analyses to validate Self-ICL's effectiveness and provide insights for its behaviors under different settings.


Introduction
Large language models (LMs) have shown striking ability to adapt to new tasks at test time by prompting with a few input-output exemplars, i.e., demonstrations (Brown et al., 2020;Wei et al., 2022a;Chowdhery et al., 2022;Wei et al., 2023).This ability is refereed to as in-context learning (ICL) (Brown et al., 2020).Towards better ICL performance, approaches for selecting representative demonstrations have been investigated extensively (Sorensen et al., 2022;Levy et al., 2022;Zhang et al., 2022a;Gonen et al., 2022).Most techniques assume the access to large-scale external (1) Given a query and a corresponding task description, the large LM is prompted to generate k (e.g., k=3) pseudo-inputs.(2) Collect the pseudo-inputs and predict their pseudo-labels via zero-shot prompting.(3) Perform ICL with pseudo-demonstrations constructed from the generated pseudo-input-label pairs.The same large LM is used in all steps.
sources (e.g., training dataset or relevant text corpus) is available, from which demonstrations can be selected with methods such as nearest neighbor search or other pre-defined, sophisticated similarity metrics (Liu et al., 2022;Rubin et al., 2022;Wu et al., 2022).However, in most real-world settings, users query large LMs (e.g., through APIs or web interface) without the access to existing corpus for their target tasks.And spending additional efforts to handcraft demonstrations may negatively affect their workflows.
Recently, a series of studies has been proposed to shed lights on the inner working of ICL (Xie et al., 2021;Reynolds and McDonell, 2021;Min et al., 2022b).Their evidences suggest that instead of contributing explicit signals for learning new tasks, demonstrations mainly expose large LMs' intrinsic functionalities and guide models towards target domains (Razeghi et al., 2022;Lyu et al., 2022).Similar clues are also partly observed in chain-of-thought (CoT) (Madaan and Yazdanbakhsh, 2022) and instruction-augmented ICL (Webson and Pavlick, 2022).These findings indicate, to some degree, large LMs carry underestimated zero-shot abilities and are already equipped to fulfill various target tasks.
Inspired by the above-mentioned literature, we propose SELF-ICL-a simple prompting framework for zero-shot in-context learning.SELF-ICL bootstraps large LM's intrinsic capabilities by selfgenerated demonstrations that inform the input and label space for performing ICL.Given a query, i.e., a test input, the procedure of SELF-ICL involves three steps: 1.The model is prompted to generate pseudoinputs conditioned on the given query and the corresponding task description.
2. The model predicts pseudo-labels for pseudoinputs via zero-shot prompting.
3. The pseudo-input-label pairs form pseudodemonstrations, which are then prepended to the query and proceed with standard ICL.
All steps adopt the same frozen large LM.Without the requirement of candidate pool for demonstration selection, SELF-ICL bridges the gap for end-user's practical needs.
To assess SELF-ICL's effectiveness on challenging, unexpected tasks for which existing demonstrations are hard to come by, we perform evaluation on a suite of 23 tasks from BIG-Bench Hard (BBH) (Suzgun et al., 2022).In a head-tohead comparison, experimental results show 16-1-6 (win-tie-lose) for SELF-ICL versus standard zeroshot on the 23 tasks.For the Chain-of-Thought (CoT) (Wei et al., 2022b) setting, the results are 12-3-8 for SELF-ICL with zero-shot CoT versus standard zero-shot CoT.In addition, the results for SELF-ICL without zero-shot CoT versus zero-shot CoT is 12-4-7.The results suggest an potential alternative to elicit LM's reasoning ability in zeroshot, without the risk of exposing unfaithful CoT explanations in the output content (Turpin et al., 2023).To the best of our knowledge, we present the first attempt for true zero-shot ICL that does not require any external data from real distribution or pre-defined label sets.

Pseudo-Demonstration Construction
Following we discuss the design of SELF-ICL for constructing ideal pseudo-demonstrations, i.e., pseudo-inputs and pseudo-labels.

Pseudo-Input Construction.
Generating pseudo-inputs can be easily achieved by zero-shot prompting large LMs with the simple prompt as shown in Figure 1 (Step 1).The given query q (from real distribution) provides an outline of ground-truth inputs, and the corresponding task description T guides the model to generate relevant information associated with the task domain.From q and T , model infers the underlying format and creates a new query, i.e., pseudo-input.By specifying a number k (number of shots) in the instruction, this process can generate multiple pseudo-inputs with one inference pass.
After obtaining the pseudo-inputs, we can predict their labels, i.e., pseudo-labels for constructing pseudodemonstration, via zero-shot prompting the same large LM.Specifically, we experiment with two methods: the zero-shot prompting and zero-shot CoT (Kojima et al., 2022) prompting.More details are described in Section 4.

Inference
With the pseudo-input-label pairs, the pseudodemonstrations are constructed.We next proceed with the typical ICL workflow but with pseudoshots.Namely, the pseudo-demonstrations (with instructions) are prepended to the test input as the context for prompting the large LM (Figure 1 Step 3).

Preliminaries
Given a set of k-shot demonstrations denoted as {(x 1 , y 1 ), ..., (x k , y k )}, where x i is the input text and y i is the label.Following the analysis framework of Min et al. (2022b), four aspects are considered for the construction of demonstrations, namely, the input-label mapping (whether x i is paired with a correct y i ), the input space (the underlying distribution behind x 1 , ..., x k ), the label space (the possible label set inferred from y 1 , ..., y k ),1 and the pairing format (the format representing the x i -y i pair).Min et al. (2022b) inspect the role of demonstrations along these four aspects, and present a surprising finding-the input-label mapping is not a necessary criteria for successful ICL.Empirically, they find randomly swapping the ground-truth label of demonstrations barely degrades end-task performance.On the contrary, the other three aspects all demonstrate great impacts.With these four aspects in mind, we now analyze the construction of pseudo-demonstrations for SELF-ICL.

The entanglement of input space and input-label mapping
Among the four aspects, the label space is usually specified in the input (e.g., options presented for multiple-choice task) or described within the task description.For example, the label space {"True", "False"} of the boolean expressions task can be easily inferred from its description "Evaluate the result of a random Boolean expression."; the pairing format is the least concern as all pseudo-inputs-label pairs are well formatted as "Q: input text, A: label text".
The potentially problematic aspects are the input space and input-label mapping.Naturally, one may think input-label mapping is not an issue as described in Section 3.1-the input does not need to be paired with the correct label.However, this intriguing discovery by Min et al. (2022b) is established under the setting of standard ICL, where the inputs are randomly sampled from the training set.As the pseudo-inputs created by SELF-ICL is based on only one reference, i.e., the given test input, the generated pseudo-inputs are likely to be of great semantic similarity with that test input, and fail to capture the correct input space distribution.In such  since it has been shown that models tend to copy the labels paired with inputs that are very similar to the test input, known as the copying effect (Lyu et al., 2022).With no guarantee for the correctness of SELF-ICL's pseudo-labels, the copying effect would potentially hurt the ICL performance.

Mitigation through diversifying
To mitigate the possible impact of the copying effect, increasing the pseudo-inputs' diversity is essential.Typically, this can be resolved by sampling demonstration inputs from different clusters of training set inputs (Zhang et al., 2022b).In our SELF-ICL framework, no real data is available.Instead, we apply a simple method: prompting large LMs to be diverse.The model is instructed to come up with "new", "diverse", and "creative" instances (See Step 1 in Figure 1 for the prompt).Following we attempt to quantitatively verify whether our generated pseudo-inputs are diverse enough in comparison with the real inputs randomly sampled from the training data, by measuring the similarity gap of the query-input distance between pseudo-and real-inputs.Given a query q, i.e., test input, a set of k randomly selected inputs {(x 1 , y 1 ), ..., (x k , y k )} from the training set, and a set of k pseudoinputs {(x 1 , ŷ1 ), ..., (x k , ŷk )} generated by SELF-ICL conditioned on q.We first define the queryinput distance d(•) between using pseudo-input and real-input as where the query and input are encoded by the sentence transformers (Reimers and Gurevych, 2019) following Liu et al. (2022); Zhang et al. (2022b).2Next, we compute the similarity gap G(•) as where Q is a set of n queries {q 1 , ..., q n } for a task.The similarity gaps for 23 tasks from BBH are presented in Figure 2. The results are averaged across five different random seeds and we provide standard deviation error bars.The larger the gap indicates the more closer the queries are with the pseudo-inputs than with real-inputs sampled from training set, and the more likely to suffer from the copying effect.As observed, most of the tasks falls inside the ±5% similarity range (dotted lines), suggesting our designed prompt is able to encourage the generation of diverse pseudo-inputs, and sufficiently resemble inputs sampled from real distributions.Thus, mitigating the potential risk of the copying effect.

Experimental Setups
To recap, the framework of SELF-ICL consists of three steps: (1) Construction of pseudo-inputs, (2) Construction of pseudo-outputs, (3) ICL with pseudo-demonstrations, i.e., pseudo-input-output pairs.We investigate two settings-zero-shot prompting and zero-shot CoT prompting-in our experiments.The two settings share the same procedure of pseudo-input construction (Step 1), and differ from Step 2 and 3. Following we discuss their details, the baselines, and implementation configurations.Full prompts are provided in Figure 4, 5 and 6 in Appendix.

Zero-Shot
Standard.The baseline of the zero-shot setting is the standard zero-shot prompting schema.Namely, we prompt the large LM with only the task description and the current test input for the prediction.
SELF-ICL.For our proposed SELF-ICL, given the pseudo-inputs from Step 1, in Step 2, we construct pseudo-labels via typical zero-shot prompting, similar as the above described standard zeroshot prompting but the test input is replaced as the generated pseudo-inputs.We construct pseudolabels one by one, i.e., for k-shot demonstration, k inference passes are required for the k pseudoinputs. 3Next, in Step 3, we construct pseudodemonstrations by the pseudo-inputs paired with their corresponding pseudo-labels from previous steps, and predict the final answer for the test input by typical ICL as described in Section 2.2.

Zero-Shot CoT
Standard.The baseline of the zero-shot CoT setting is the standard zero-shot CoT prompting proposed by Kojima et al. (2022).Specifically, we prompt the large LM with the task description, the current test input, and a trigger phrase, "Let's think step by step." for performing CoT reasoning.The trigger phrase is appended at the very end of the prompt, guiding the model to generate its intermediate reasoning steps which lead to a more accurate final answer.

SELF-ICL.
In the zero-shot CoT setting, SELF-ICL generates pseudo-labels by zero-shot CoT prompting.The procedure follows the above-mentioned standard zero-shot CoT with the pseudo-inputs as context, instead of the test inputs.We then take the generated reasoning chain containing the answer for pseudo-inputs as the pseudo-labels for constructing pseudo-demonstrations.And we proceed Step 3 as illustrated in Section 2.2 with the additional "Let's think step by step."trigger phrase added at the end of the prompt to perform pseudofew-shot CoT prompting.For evaluation of the test input, only the final answers are extracted.Table 1: The main results for our proposed SELF-ICL on a suite of BBH tasks.SELF-ICL steadily outperforms the baselines in both the zero-shot and zero-shot CoT settings.In addition, SELF-ICL without zero-shot CoT is also comparable and even slight surpassing zero-shot CoT.

Configurations
Language model.For our experiments, we use InstructGPT (text-davinci-003) (Ouyang et al., 2022) via the OpenAI GPT-3.5 API. 4 Though details on specific model information are not well-provided by OpenAI, we choice InstructGPT as it is one of the most capable large LMs available, with relatively stable controllability.For hyperparameters, we set the temperature to 0 and the the maximum number of token as 1024.Other arguments are kept as their default values.
Dataset.We adopt the BIG-Bench Hard (BBH) benchmark for our evaluation.BBH consists of a suite of tasks from the BIG-Bench benchmark (Srivastava et al., 2022), which existing LMs have difficulty reaching the average human-rater performance and are considered beyond current models' capabilities.BBH contains a total of 27 tasks, from which we select 23 tasks that are multiple-choice tasks as our evaluate testbed for SELF-ICL.Each BBH tasks has around 150 ∼ 250 examples, due to budget constraints, for each tasks we sample 100 instances for our experiments.

Results
We present our experimental results in Table 1.As observed, our proposed SELF-ICL consistently surpasses the baselines in both the zero-shot and zeroshot CoT settings on the all tasks average performance.We present head-to-head comparisons on the 23 tasks in Figure 3.The results of zero-shot setting are 16-1-6 (win-tie-lose) for SELF-ICL versus standard zero-shot baseline; for the zero-shot CoT setting, the results are 12-3-8 for SELF-ICL versus standard zero-shot CoT baseline.
Interestingly, the results are 12-4-7 for SELF-ICL without zero-shot CoT, i.e., the SELF-ICL in zero-shot setting, versus standard zero-shot CoT.In addition, superior performance is exhibited for the all tasks average results as well.SELF-ICL without zero-shot CoT sheds light on an alternative to elicit LMs' reasoning ability in zero-shot, without generating potentially misleading or biased reasoning chains which could undermine huamn-AI interactions.
Figure 3: The head-to-head comparison on the 23 tasks from BBH.The accuracy delta indicate the accuracy difference between SELF-ICL and the baseline method (blue/orange indicates our method wins/loses).The results are 16-1-6 (win-tie-lose) for zero-shot setting; 12-3-8 for for zero-shot CoT setting; and 12-4-7 for SELF-ICL without zero-shot CoT versus standard zero-shot CoT.
6 Related Works 6.1 Understanding ICL With the popularization of various large LMs, ICL has emerged as a new paradigm for the field of natural language processing (NLP).Despite its striking capability, the mechanics behind ICL are still an open question in the research communities.To develop a deeper understanding of ICL, Chan et al. (2022) investiagte the training data distribution of large LMs, and find specific distributional properties and the transformer-based architecture (Vaswani et al., 2017) could drive the ICL behaviours.Recent studies also provide explanations viewing LMs as meta-optimizers with meta-gradients applied in the forward passes, and show evidence of resemblances between ICL and the explicit fine-tuning process (Dai et al., 2022;von Oswald et al., 2022;Akyürek et al., 2022).

Improving ICL
Towards better empirical performance of ICL, approaches for designing ideal prompts and demonstrations have been vastly investigated (Min et al., 2022a;Lu et al., 2022b;Su et al., 2022;Zhou et al., 2022;Lu et al., 2022a;Fu et al., 2022).A recent work by Zhang et al. (2022b) address the need of human-annotated few-shot CoT by utilizing zero-shot CoT to construct demonstrations.Their method differ from our as they require an existing training set from which shots are sampled as inputs to zero-shot CoT.Lyu et al. (2022) attempt to exclude the need of pre-given demonstration candidate set by selecting semantically relevant sentences from an raw text corpus (which is not from the task datasets) as pseudo-inputs.And pair the selected pseudo-inputs with randomly assigned labels as demonstrations for ICL.Though more similar to our setting, they still need an access to external sources for constructing pseudo-inputs.Moreover, they are limited to classification tasks where a fixed set of labels is shared among all inputs.In contrary, SELF-ICL generates different input-dependent options for the multiple-choice tasks, and can easily extend to other generation tasks.
The most similar work to our is by Kim et al. (2022), where they explore the possibility of generating pseudo-inputs by the large LM itself, without any external data source.However, their framework requires assess to the label set.They generate the pseudo-input by conditioning the LM on a label given in the prompt.Such design dose not align with practical usage as it greatly restrict the scenario to fixed classification tasks.As a results, their evaluation is limited to only text classifications (sentiment classification and natural language inference), which are relatively simple and well-studied comparing to BBH in our evaluation.

Conclusions
In this work, we introduce SELF-ICL-a simple prompting framework for zero-shot in-context learning.Given a test input and the task description, SELF-ICL prompts the LM to generate pseudo-inputs, and predict their corresponding pseudo-labels via zero-shot (CoT) prompting.The pseudo-input-label pairs form pseudodemonstrations, which are then prepended to the test input for ICL.Experimental results on an array of challenging BBH tasks show that SELF-ICL steadily outperforms zero-shot and zero-shot CoT baselines for head-to-head and all-task average accuracy.To the best of our knowledge, we present the first true zero-shot approach for ICL, and demonstrate the potential of bootstrapping LMs' inner capabilities to improve zero-shot performance.

Limitations
Reliance on instruction-following models.To follow instructions, understanding unseen target tasks and generate pseudo-inputs and labels via zero-shot prompting, a key driver of our SELF-ICL framework is the powerful instruction-following LM.If the model is not equipped with such zero-shot generalization capability, the results of SELF-ICL would be inferior.
Better diversify approaches.To mitigate potential risks of suffering from the copying effect, we simply construct heuristic prompts to tell the LM to generate diverse pseudo-inputs.And under limited budget, we do not perform comprehensive prompt searching or experiment with temperature adjustments.In the future, others should explore methods along the line of one-shot data augmentation for constructing optimal pseudo-demonstrations.

Figure 1 :
Figure 1: Our proposed SELF-ICL framework for zeroshot in-context learning.SELF-ICL involves three steps:(1) Given a query and a corresponding task description, the large LM is prompted to generate k (e.g., k=3) pseudo-inputs.(2) Collect the pseudo-inputs and predict their pseudo-labels via zero-shot prompting.(3) Perform ICL with pseudo-demonstrations constructed from the generated pseudo-input-label pairs.The same large LM is used in all steps.

Figure 2 :
Figure2: The similarity gap of the query-inputs distance between pseudo-and real-inputs.Most tasks fall into a samll ±5% range (the dotted lines), indicating the pseudo-inputs are close to the real-inputs and are likely robust against the copying effect if their pseudo-labels are incorrect.