Two Examples are Better than One: Context Regularization for Gradient-based Prompt Tuning

Prompting has gained tremendous attention as an efficient method for the adaptation of large-scale language models. However, prompts often act against human intuition and report unstable performances, which has motivated meth-ods that automatically find effective prompts. One popular approach is gradient-based search, which iteratively updates a (randomly) initialized prompt towards the optimal one with the guide of gradients. We propose a novel regularization method, CoRe, for gradient-based prompt tuning techniques, which guides a prompt to produce a task context properly. CoRe realizes two regularization effects — context attuning and context filtering — that improve prediction performance in a zero-shot in-context learning setting where a model makes inferences only with the prompt tuned by CoRe, without any demonstration examples for in-context learning. Context attuning guides the context generated by the input and the tuned prompt toward embedding the appropriate context for the task. In our theoretical analysis, reg-ularizing the context extends to improving zero-shot in-context learning performance. Context filtering steers the prompt to select only the task-related context so that context attuning solely focuses on creating and sending the right task context. We evaluate CoRe on natural language understanding datasets and two large language models, GPT2-XL and GPT-J. Our training scheme shows performance improvements up to 11.9% on GPT2-XL, and up to 6.3% on GPT-J


Introduction
A prompt is a carefully composed input to adapt a pretrained language model (LM) with in-context learning, and has gained attention as a new method for the adaptation of LMs.Radford et al. (2019) and Brown et al. (2020) showed that pretrained LMs can be transferred to various downstream NLP tasks by designing prompts to be aligned with pretraining objectives and downstream tasks.This  Initially, a prompt is manually crafted by a developer with human intuition.However, because of the unstable performances and unpredictable behaviors of manual prompts (Liu et al., 2021), automatic methods for prompt tuning have been proposed.One of the most popular paradigms is optimizing prompts using stochastic gradient descent (SGD) (Lester et al., 2021;Shin et al., 2020;Liu et al., 2021;Zhong et al., 2021) -gradientbased prompt tuning, which updates prompt tokens or parameterized embeddings with a guide of gradients.
To further boost the performance of gradientbased prompt tuning methods, this paper suggests a new context regularization scheme, CoRe.CoRe introduces two regularizers: context attuning and context filtering.Context attuning guides a prompt ω and an example {x, y} to generate an appropri-ate task context without being biased towards the example.This is realized as a concatenation of multiple examples, resulting in a single sequence including the examples, as the concatenation of the three examples in Figure 1.Context attuning then optimizes the first prompt (ω that comes before x 0 ) towards minimizing the losses of the succeeding examples (y 1 , y 2 ).Therefore, the prompt does not become biased to {x 0 , y 0 }, but is generalized for the appended examples as well and generates less biased task contexts.
However, context attuning alone does not show the effect of regularization because the following examples receive too diverse contexts, such as styles, topics, relationships between entities, or task information.Such diversity hinders the optimization of the regularizer towards the task context.To resolve this problem, we add another regularizer, context filtering, to adequately extract the task context from the preceding examples.Specifically, following the example in Figure 1, context filtering (blue line) optimizes prompts given preceding examples -optimization of ω 1 with respect to y 1 when {x 0 , y 0 , x 1 } is given.
We evaluate the effect of CoRe in a zero-shot in-context learning (ICL) setting as most prompt tuning methods have focused on the setting.In zero-shot ICL, a tuned prompt and a test input is only given without any additional demonstration examples.While ICL often assumes demonstration examples are given, for simplicity we call this setting zero-shot ICL.It can be also seen as a typical zero-shot instruction following.
Context attuning and context filtering improve the model's performance in a zero-shot in-context learning (ICL) setting even though the prompts have been trained in a few-shot ICL setting (i.e., concatenation of examples).We conjecture that this successful transfer from a few-shot setting to a zero-shot setting is possible since prompts have been tuned with the guide of CoRe to generate a more generalized task context.Such a generalized task context is effective regardless of concatenated examples, thus improving the performance in a zero-shot setting.We highlight that this transferability from few-shot settings to zero-shot settings is noteworthy and verify it on a theoretical framework based on the hidden Markov model and Bayesian inference (Xie et al., 2022).
To the best of our knowledge, this work is the first to leverage interactions between examples in a sequence for the purpose of regularization.There are several works training on multiple examples in a sequence (Min et al., 2021;Chen et al., 2022;Gao et al., 2021).They focused on improving few-shot in-context learning by maximizing the likelihood of an example given demonstration examples.Our work is distinct from the above works in two aspects.First, we target zero-shot settings.Second, we introduce a novel regularization technique -context attuning-to improve zero-shot performance by leveraging the influence between multiple examples within a sequence.

Background
In-context learning (ICL) is a relatively new paradigm in the adaptation of pretrained language models (LMs).ICL have been used to adapt pretrained LMs to numerous downstream tasks such as Natural Language Inference (Schick and Schütze, 2020), text classification (Gao et al., 2021;Puri and Catanzaro, 2019;Schick and Schütze, 2020), question answering (Jiang et al., 2021;Khashabi et al., 2020;Raffel et al., 2020) and many more.
In ICL, a prompt is a task specification given as part of the input and aims to guide the language model into generating the desired output according to the given task.A prompt usually consists of a task description, a template of the input with placeholders for data, and optionally demonstration examples.For example, when given an input "Translate English to Spanish: I love you →", the LM is expected to finish the input by generating the answer "te amo".This is different from fine-tuning, where the parameters of the LM are updated to optimize the translation performance.
Compared to fine-tuning, which is a representative LM adaptation technique, ICL is more memory-efficient when we want the LM to process various different tasks.In ICL, the LM parameters are typically frozen, and developers only have to keep at most dozens of tokens per task, which are task descriptions and input templates.However, for fine-tuning, developers have to keep different versions of parameters per task, which can be as huge as 540 billion parameters (Chowdhery et al., 2022) in the status quo.

Manual prompt tuning
In the early stages of prompting, prompts were often handcrafted with heuristics.Radford et al. (2019) and Brown et al. (2020) demonstrate the transferability of pretrained LMs using manual prompts.These two works show that LMs can perform well on diverse NLP tasks when provided with some adequate prompts.
In particular, Brown et al. (2020) highlights incontext learning, a new training scheme where several demonstration examples are appended in front of the target input.LMs are expected to learn the task by referring to the prepended demonstrations only.Like other GPTs, GPT3 is also known to require carefully crafted prompts to get the task done adequately.
In subsequent works, manual prompts have been applied to a number of research areas such as factual probing (Petroni et al., 2019;Jiang et al., 2020a), knowledge mining (Davison et al., 2019), question answering (Khashabi et al., 2020;Raffel et al., 2020), and text classification (Puri and Catanzaro, 2019;Schick and Schütze, 2020).Despite their remarkable adaptation ability, manually crafted prompts have exhibited unstable performances.Besides, prompts that show good performance are, in fact, quite against commonsense, unable to understand why some specific prompts work well (Liu et al., 2021).

Automatic prompt tuning
Along with the unstable performances and unaccountable behavior of manual prompts, creating manual prompts involves extensive human effort in devising and evaluating numerous handcrafted templates of different patterns.Therefore, recent studies suggest techniques to automatically search for an optimal prompt.
One of the most popular methods is gradientbased methods, which keep updating prompt tokens or parameterized prompt embeddings leveraging gradients.For an example, Lester et al. (2021), OP-TIPROMPT (Zhong et al., 2021), and P-tuning (Liu et al., 2021) first concatenate initialized embeddings for prompts with the embeddings of input data in a pre-defined order.They then directly optimize prompt embeddings on a given task dataset.
The optimized prompt embeddings are used at inference time without any modification.
Other than gradient-based methods, Jiang et al. (2020b) proposes a mining-based method that uses the Wikipedia text corpus to extract a prompt.Gao et al. (2021) uses the pretrained T5 model (Raffel et al., 2020) to fill in the missing spans of a given input sentence to construct a prompt.Guo et al. (2021) proposed an RL method for a controlled generation of LM, and applied it to prompt generation.

Approach
Our key approach to improve SGD-based prompt tuning is to concatenate multiple examples and regularize a context generated by them, which transfers to improvement in zero-shot prediction.Our training scheme first concatenates multiple examples from the same task.Taking an example of concatenating three examples (x 0 , y 0 , x 1 , y 1 , x 2 , y 2 ) as depicted in Figure 1, a model is given an input S = {ω, x 0 , y 0 , ω, x 1 , y 1 , ω, x 2 , y 2 } where ω is the tunable prompt.Then, CoRe introduces two regularizers: context attuning and context filtering.The context attuning regularizer steers an example to send the correct task context to the succeeding examples, while the context filtering regularizer guides the succeeding examples to receive only the task context among diverse contexts from preceding examples.
We target zero-shot ICL where only a prompt and a test input are provided (i.e., no demonstration examples).During inference, a model is given ω, x i and expected to predict y i where ω has been trained with CoRe.We denote this setting as zeroshot in-context learning to imply that the model learns from the context provided by the tuned prompt, but we do not give any demonstration examples (zero-shot example for in-context learning).Please note that this differs from the typical zeroshot setting.Although the model still does not see any training example for making predictions, training data has been used when optimizing the prompt with CoRe.
We would like to highlight our finding that prompts tuned with CoRe in a few-shot ICL setting show improved performances in a zero-shot setting.CoRe's two regularizers -context attuning and context filtering-are applied on a concatenation of multiple examples (i.e., few-shot ICL), and the regularizers steer the prompt during train-ing time to generate a more generalized task context for following examples (x 1 , x 2 ) in a sequence.Such a well-generated task context (generated by ω) is propagated not only to the following examples (x 1 , x 2 ) but also to the example (x 0 ) paired with the prompt.Therefore, a prompt tuned with CoRe helps the model's prediction when given an input of a single example (S = {ω, x i }), which is equivalent to a zero-shot ICL setting.
Throughout this section, we use a simple form of prompt for conciseness of the explanation but without loss of generality: {ω, x i , y i } where ω is a tunable prompt.Our work focuses on autoregressive language models (LMs), so our method is not explored or analyzed upon autoencoding LMs such as BERT (Devlin et al., 2019).

Context attuning
Context attuning prevents a prompt from being biased to a single example for generalization on the zero-shot ICL setting, using regularization on a context.In detail, this regularizer guides a prompt ω and an example {x, y} to create an appropriate task context without being biased towards the example.The prompt of the first example plays a key role in context attuning-it minimizes not only the losses of the first example (black line in Figure 1) but also the losses of the succeeding examples (blue line in Figure 1).We hypothesize that such optimization inherently regularizes a task context from the prompt and prevents the prompt from being too biased to a single example, ultimately achieving better generalization ability.

Mechanism
We give a formal definition of the context attuning regularizer.In autoregressive LMs, as depicted in Figure 1 with the red lines, our training method optimizes as the following: where S k = {ω k , x i,k , y i,k } is the k-th example in the sequence, S <k = {S j |j < k}, ω k is a trainable parameterized prompt, s is the number of concatenated examples, and θ is the parameters of the LM.ω k (k ̸ = 0) is the same with ω 0 but regarded as a constant and not optimized.We introduce k to represent the positions of the prompts and indicate which prompts are optimized.On our method, the prediction of each example S i,j is conditioned on the prompt ω 0 paired with the first example, "The gorgeously.." in Figure 1 for an example, and the prompt is optimized for multiple examples.
We were inspired by few-shot ICL (Radford et al., 2019;Brown et al., 2020) when designing the context attuning regularizer.When predicting the answer for a new input, few-shot ICL prepends a few demonstrations -input-answer pairs -to the new input and condition on those demonstrations to understand the task.The success of few-shot in-context learning has shown that prepended examples provide some meaningful information (e.g., task context) to succeeding data.Few-shot ICL leverages the interaction among examples during inference; our method can be considered as leveraging the same advantage during training (prompt tuning) to achieve better generalization.

Theoretical analysis
We analyze how the regularizer affects context attuning using a theoretical framework from Xie et al.
where h s k is a hidden state corresponding to x k .Note that we omit the index of sequence i as our analysis is done on a single sequence.
We are interested in how the regularizer attunes the task context and generalizes to maximize i p(y i,0 |ω 0 , x i,0 ), which is zero-shot inference.On the framework, our regularizer affects the terms that ω 0 is involved in: p(h s k |S <k , ω k , x k , c) and p(S <k , ω k , x k |c).We analyze the two terms to identify the effect of the cross-data regularizer on zero-shot inference.
First, optimizing p(h s k |S <k , ω k , x k , c) prevents the prompt from being biased towards a single data instance by optimizing the task context passed to the following examples.The term we optimize can be expanded as follows: where S −ω 0 <k−1 = S <k−1 − {ω 0 }.Optimizing only zero-shot inference p(y 0 |ω, x 0 ) may cause the hidden state h s 0 to be biased to the data instance.However, with our regularization, the prompt ω 0 is tuned to pass the proper task context via the hidden state h s k to the following examples.To increase the probability of such hidden state, the prompt should be tuned to increase the probability p(h s 0 |ω 0 , x 0 , c) where h s 0 is likely to transit to the proper hidden state h s k , where the first term of RHS of Equation 3, ), is high.Since h s 0 is unlikely to transit the hidden states that embed the proper task context when h s 0 has already been biased to a specific data instance, the bias is naturally avoided.Finally, the unbiased hidden state h s 0 improves the average zero-shot inference performance, which is the average of Equation 2 over the dataset where k = 0.
Moreover, the optimization also calibrates p(S <k , ω k , x k |c) to align the input data to the optimal concept.Comparing that the original method only optimizes p(ω 0 , x i,0 |c), S <k contains y 0 so we can align the input to the task additionally considering the answer for the input.

Context filtering
The context filtering regularizer guides prompts to filter and receive only the task-related context (task context).With context attuning regularizer, CoRe help LMs to obtain unbiased hidden states for zero-shot inference but the context attuning regularizer alone cannot realize the regularization effect if hidden states of succeeding examples are too noisy.
Hidden states deliver various contexts, including not only task contexts but also contexts related to the style, topics, or content of preceding examples.
The following examples' predictions depend on the mixed contexts.Such phenomenon has been empirically reported in the work of Liu et al. (2022) where each example has a unique set of optimal demonstrations.
We conjecture that contexts that are not directly related to the task add noise to optimization on hidden states, undermining the regularization effect of context attuning regularizer.Thus, there should be some auxiliary mechanism that selects only the set of hidden states that convey task-related optimal signal and the context filtering regularizer serves that role.
Context filtering regularizer steers a parameterized prompt to filter only the set of hidden states that convey task contexts.We maximize the likelihood of an example given prepended demonstration examples, and this is the same loss with MetaICL (Min et al., 2021) and ICT (Chen et al., 2022) but without meta-learning.Specifically, we use the following SGD step: as depicted in Figure 1 with the blue lines.The only required context for inference with demonstrations is the task context extracted from the preceding demonstrations because the task is the only correlation between multiple examples.Therefore, this regularizer intrinsically leads the prompt to extract a proper task context from preceding examples.

Practical objectives
Our SGD step consists of the original likelihood maximization and the two regularizers: when we place two examples in a sequence.This can be implemented by simply concatenating examples and minimizing the cross entropy of all of the answers.However, when we concatenate more than two examples, we need a more complex implementation and multiple iterations for a sequence because some prompts should be regarded as constants.For efficient training, we modify the losses by allowing ω k>0 to be optimized for the crossdata regularizer so that we can simply compute the required gradients in a single iteration.We describe the modified losses and the complexity of the original loss in detail in Appendix A.

Experimental Setup
In this section, we briefly specify the setup that our experiments are conducted on: models, datasets, and hyperparameters.For more details, please check Appendix B.
Input Construction for CoRe In our experiments, the input is formed by simply adding several trainable prompt embeddings in front of each element of a data instance.This way of constructing an input is highly convenient because it does not require manual human effort and additional tuning for each task.For example, a single data instance of natural language inference tasks consists of three elements: premise, hypothesis, and label.We transform the data instance into (e 0 , premise, e 1 , hypothesis, e 2 , label) where e i ∈ R n×h is a parameterized prompt that we optimize, and h is the hidden state size.We use n = 1 for P-tuning and Softprompt.For Prefix-tuning, we use a different template since prefix-tuning appends prompts only at the front of a data instance.We transform the data instance into (e 0 , premise, hypothesis, label) where e i is a parameterized prompt of size 5.
Sequence Size and Batch Size We introduce a special hyperparameter used in CoRe, named sequence size.
This refers to the number of data instances concatenated for a single training example when CoRe.For example, an input for CoRe of sequence size 2 and 3 becomes {ω 0 , x 0 , y 0 , ω 1 , x 1 , y 1 } and {ω 0 , x 0 , y 0 , ω 1 , x 1 , y 1 ω 2 , x 2 , y 2 }, respectively.Please note that CoRe of sequence size 1 is equivalent to the standard SGD training.To make a fair comparison, we keep the size of batch size = (number of sequences in a batch) × (sequence size) to be a constant.This way, a model sees an equal amount of data for every iteration of CoRe across all sequence sizes.
Evaluation We use accuracy averaged over different random seeds.We use ten seeds for GPT2-XL experiments, and five for GPT-J experiments except for MultiRC experiments, where we use five and three seeds respectively.

Experimental Results
We evaluate CoRe on the setup we presented in Section 4. We first show the performance gain of our method on the three baselines.Then, we present how the performance changes according to sequence sizes, and further analyze the effect of two regularizers of CoRe with the ablation studies.In addition, we present how the similarity of concatenated examples affects performance.Finally, we present the performance of CoRe in few-shot inference settings.

Main result
We first experiment with our method on the three baselines, which use parameterized continuous prompts.Table 1 compares the zero-shot performances of prompts trained only with baseline tuning methods against prompts trained with CoRe on top of the methods.On GPT2-XL, CoRe shows improvements in accuracy compared to P-tuning, Prefix-tuning, and Softprompt, up to 11.9%.On GPT-J, CoRe also shows consistent enhancements up to 6.3% for the three methods.The absolute gains are reduced on GPT-J, compared to the gains on GPT2-XL.This is expected as it is typically hard to earn a large gain as the baseline performance increases.
Interestingly, we found that our method shows a higher accuracy gain for NLI tasks -SuperGLUE CB and RTE -on GPT2-XL across all three baselines.We can hypothesize that there are some correlations between NLI tasks and the pretraining objective, or between the tasks and the pretraining dataset of GPT2-XL.It is worth analyzing the relationship between the pretraining and downstream tasks to further improve the gains for other datasets as a future work.
We observe that there are mainly two cases where CoRe does not show performance gain.First,  CoRe performs worse than the baseline on Super-GLUE MultiRC.We conjecture that task context does not get properly propagated since training samples of MultiRC are far longer than samples of other SuperGLUE subsets.
Another observation is that CoRe generally does not work well for Prefix-tuning on GPT-J.We hypothesize that this might have some correlation with Rotary Positional Embedding (RoPE) of GPT-J.It has not yet been explored how the mechanism of Prefix-tuning (appending prompts to all the key and values of each layer) interact with RoPE and this may undermine the proper working of CoRe.

How many examples to concat?
To see how the sequence size of CoRe affects the performance, we train prompts on GPT2-XL and GPT-J using varying sequence sizes from one to four. Figure 2 shows the result of P-tuning on GPT-J with zero shot evaluation, and In most cases, a sequence size of 3 or 4 exhibits the best performances on GPT-J, but CoRe shows better performance with smaller sequence sizes, 2 or 3, on GPT2-XL.This shows that context attuning leads to context with less bias when given more data in a sequence, and provides effective regularization.Increasing the sequence size over a certain level results in accuracy drops.From this, we infer that too many examples in a single sequence lead to causing noise to become part of the context.

The impact of each regularizer
To see how context attuning and context filtering each contributes to the performance gain from CoRe, we conduct ablation studies with gradients involved in CoRe.We factorize the gradients as in Equation 5 with a sequence size of 2, and the gradients represent the gradient in the baseline method, context attuning, and context filtering, respectively.We then test the effect of each type of gradients.
As presented in Table 2, in general, CoRe exhibits the best performance when all three gradients (baseline, context attuning, and context filtering)

Measurement of bias caused by prompts
To showcase the regularization effect of CoRe, we compare how much prompts are biased to training data samples before and after applying CoRe.For quantifying the amount of bias, we use 'one-step generalization ratio (OSGR)' suggested by (Liu et al., 2020) performance gain shown in Table 1.

Which examples to concat?
Previous studies show that concatenating semantically similar examples improves downstream task performances.Liu et al. (2022) empirically shows that demonstration examples that are semantically similar to the target example improve in-context learning performances.Some studies also show that placing semantically similar examples in an input sequence during finetuning (Gao et al., 2021) or pretraining (Levine et al., 2021) improves downstream task performance.We experiment how semantic similarity between examples in an input sequence affects the performance of CoRe.We consider CoRe with sequence size 2.
For the experiment, we first sample batch size/2 samples for each iteration during prompt tuning.For each sampled example, we select the most similar example (max) or the least similar example (min) among the unseen examples in the epoch, and append the selected example after the sampled example.We also experiment with less extreme similarity by sampling an example to be appended from 10% most similar (max 10%) or 10% least similar (min 10%) examples of the unseen examples in the epoch.We measure the semantic similarity between examples using stsb-roberta-large model from sentence transformers (Reimers and Gurevych, 2019).
As presented in Table 4, sampling considering semantic similarity degenerates the performances of CoRe compared to the standard sampling, where demonstrations are sampled at random.This shows that CoRe also benefits from concatenating semantically similar examples -standard sampling reports better evaluation result compared to min and min 10%.However, choosing to concatenate the more similar example (max and max 10%) introduces a bias to the prompt, and the bias depresses the regularization effects of CoRe.

Few-shot in-context learning
We also evaluate our method in a few-shot incontext learning setting, where we prepend multiple examples as demonstrations to a target example during evaluation.Although the few-shot setting is not originally our target, the objective of two regularizers -to optimize a prompt towards sending and receiving a proper task context -naturally contributes to improving the predictions prepended with demonstrations -that is, few-shot in-context learning.Figure 3 shows the result of few-shot evaluation on three different datasets with prompts from P-tuning and CoRe on P-tuning.
We observe that both P-tuning and CoRe on Ptuning report degradation in general as the number of demonstrations increases.However, CoRe on P-tuning exhibits far less accuracy drop across different numbers of demonstrations and even performance gain in the case of CB dataset.We assume that less degradation compared to P-tuning comes from two regularizers of CoRe.

Conclusion
CoRe is a novel gradient-based prompt tuning method that regularizes task contexts among multiple examples, which finally regularizes a prompt to improve zero-shot performance.Two regularizers of CoRe, context attuning and context filtering, regularize the prompt to create proper task context and guides the prompt to convey only the task-related context to succeeding examples on the concatenation of multiple examples.We provide a theoretical analysis for the effect of context attuning and context filtering and our experimental results back this up.Following the theoretical analysis, CoRe achieves performance gain over three different baseline prompt tuning methods in zeroshot setting up to 11.9% on GPT2-XL and 6.3% on GPT-J.It implies that CoRe can serve as an effective and memory-efficient adaptation method for hyper-scale LMs.

Limitations
Our study has three limitations: • As reported in Section 5.1, NLI tasks significantly benefit from CoRe while other tasks marginally do, or there is no benefit at all.We suppose that such a difference comes from characteristics of a task.However, we have not yet thoroughly explored which characteristics of a task attribute the performance gain.
To solve this problem, we need a novel deep learning interpretation method to probe latent contexts of LM, or a thorough analysis on relationship between the pretraining objective and downstream tasks and how prompting bridges two distinct phases.We leave these research questions as our future work.
• CoRe does not work for cases where the train dataset has too long sequence texts.CoRe requires multiple examples to be concatenated, so developers cannot benefit from CoRe if a majority of concatenated examples from their dataset exceed the maximum sequence length of an LM.
• We have not yet analyzed whether CoRe is applicable to natural language generation (NLG) tasks.NLG is undoubtedly an important pillar in natural language processing research, along with NLU, with many interesting applications.We believe that the concept of context attuning and context filtering can be of help to major challenges in NLG, for example, controlled NLG.We plan to explore CoRe on NLG tasks after this submission.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Abstracted illustration of the two regularizers, context attuning (red line) and context filtering (blue line), of CoRe with sentiment analysis data.The black line indicates regular gradient-based prompt tuning where an input sequence consists of one data example (x 0 and y 0 ) and the prompt ω 0 is optimized with respect to y 0 only.
To realize the goal, we first put multiple examples in an input sequence so that examples can exchange the task context with one another.This is different from typical SGD-based prompt tuning where each input sequence has only one training example, and the gradients cannot flow between the examples.In the sequence, the appended examples attend to the preceding examples and are influenced by the context of preceding examples during predictions.
(2022)  and verify that context attuning improves zero-shot inference.Xie et al. (2022) designed the framework to explain what enables in-context learning of LMs.They viewed an LM as Bayesian inference with a hidden Markov model (HMM), and the transitions of the HMM are parameterized by a latent concept c, which represents a task.On their framework, our regularizer can be expanded as follows:

Figure 2 :
Figure2: Zero-shot evaluation on varying sequence sizes.Accuracy for each dataset is normalized with respect to its accuracy when the sequence size is 1.

Figure 3 :
Figure 3: Comparison of few-shot in-context learning performance between P-tuning and P-tuning with CoRe, normalized by zero-shot performance of P-tuning.Number of demonstrations we use is a multiple (x axis) of the number of each dataset's classes.

Table 1 :
Comparison of zero-shot evaluation results between three baselines and applying CoRe to the baselines across various NLU datasets.For the evaluation metric, we used averaged accuracy.We highlight the better one among a baseline and that with CoRe.

Table 7 ,
Table 8, and Table 9 in Appendix C presents the results of all of the baselines on both models with the exact numbers.

Table 2 :
Ablations studies on the effect of context attuning regularizer (A) and context filtering regularizer (F) where sequence size is 2.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 4 and Appendix B C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?We reported the average performance of multiple runs up to 10 runs.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.