Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations

Despite recent explosion of interests in in-context learning, the underlying mechanism and the precise impact of the quality of demonstrations remain elusive.Intuitively, ground-truth labels should have as much impact in in-context learning (ICL) as supervised learning, but recent work reported that the input-label correspondence is significantly less important than previously thought.Intrigued by this counter-intuitive observation, we re-examine the importance of ground-truth labels in in-context learning.With the introduction of two novel metrics, namely Label-Correctness Sensitivity and Ground-truth Label Effect Ratio (GLER), we were able to conduct quantifiable analysis on the impact of ground-truth label demonstrations.Through extensive analyses, we find that the correct input-label mappings can have varying impacts on the downstream in-context learning performances, depending on the experimental configuration.Through additional studies, we identify key components, such as the verbosity of prompt templates and the language model size, as the controlling factor to achieve more noise-resilient ICL.


Introduction
Large-scale language models (Rae et al., 2021;Chowdhery et al., 2022;Smith et al., 2022;Thoppilan et al., 2022) have shaped the NLP scene by introducing in-context learning (ICL) (Brown et al., 2020) as a novel approach to adapt language models for downstream tasks without explicit finetuning.ICL enables language models to learn and predict from task-specific prompts that contain demonstrations in the natural language format, Figure 1: A demonstration of cases where the effect of the ground-truth label in in-context learning is much more significant than the aggregated results reported by Min et al. (2022b).
despite the language models were only trained to predict the next word token.Inspired by the new discovery, a flurry of recent work has investigated ways to explain and exploit the ICL mechanism (Schick and Schütze (2021a); Lu et al. (2022); inter alia), but it remains elusive.Min et al. (2022b) have recently re-evaluated the role of input-label correspondence in demonstrations for ICL.Specifically, the authors have shown that the correct mapping between input and its label contributes less to the final performance than we thought compared to other aspects, including the format of demonstrations and the awareness of the input and label space.This finding is intriguing and has been sensational, as it is counter-intuitive to the expectation of how statistical learning typically works in supervised settings, and therefore it shows a potential of exploiting (few-shot) incontext learning given no real training data.For example, prior work established the strong impact of example ordering (Zhao et al., 2021), hence in-context learning being less sensitive to the correctness of label demonstrations, which forms the basis of supervised learning, seems contradictory.However, we encountered cases where the observation is inconsistent with the recent finding on the matter (Figure 1).Specifically, we found that the difference between the performance from the ground-truth label demonstration and that from entirely incorrect labels was as large as 80% (accuracy) for the hate speech dataset (de Gibert et al., 2018) on GPT-J (Wang and Komatsuzaki, 2021).Similar observations were found with the larger GPT-3 (Brown et al., 2020) model and other datasets (TREC (Li and Roth, 2002)).These cases illustrate how sensitive in-context learning can be to label demonstrations depending on the ICL settings.Thus, we cast a doubt on whether the trend can be generalized in diverse configurations, raising a call for an in-depth analysis of the phenomenon.
In this paper, we revisit the findings of Min et al. (2022b) and take a closer look into the importance of ground-truth labels for in-context learning.First, we point out limitations of the existing work.Then, we introduce novel metrics, namely Label-Correctness Sensitivity and Ground-Truth Label Effect Ratio (GLER), to reveal that the input-label correspondence plays a more vital role in contextual demonstration than previously considered.Furthermore, we show that the trend contradictory to the previous discovery becomes salient if we diverge the experimental settings (e.g., datasets, metrics, and templates) from the previous work.We observe the same trend in various language models, such as GPT-J and GPT-3 (Brown et al., 2020).
In addition, this paper uses statistics to provide a systematic and complementary perspective to the existing findings on the label-demonstration impact.To be specific, we combine linear regression and auxiliary metrics to conduct all-around and deeper analyses on how the ICL classification performance changes against label-demonstration corruption.To do so, we define the notion of sensitivity to quantify the degree to which the downstream classification performance changes when a model is subject to a fixed amount of label corruption.As a result, we demonstrate several noticeable patterns that support the claim that there is a considerable relationship between the performance and label correctness.It is worth noting that this trend was not clearly visible in the previous work, where the results of each dataset are macro-averaged rather than individually analyzed.
However, insensitivity, or robustness, towards the incorrectness of label-demonstrations is a useful property to have for many situations.For example, when augmenting an extremely small number of (e.g., less than four) examples using data augmentation techniques, exhibiting performance resilience towards prompt templates that consist of noisy synthetic examples as demonstrations is desirable.We further analyze how different factors of ICL, such as the inference method, the underlying language model, and the adoption of advanced ICL strategies, affect the performance sensitivity towards noises in input-label demonstrations, paving the way for a new approach to exploiting the demonstration insensitivity.
In summary, our contributions are as follows.
• We re-examine the recent findings on the phenomenon that the ICL performance is insensitive towards input-label demonstrations.
• We propose two new quantifiable metrics, sensitivity and GLER, to measure the impact of ground-truth label demonstrations on ICL.
• We conduct a thorough examination of how different components of ICL could impact the model's insensitivity towards label noises, allowing future work to exploit such property.

Looking Deeper into Ground-Truth Labels
Demonstrations of ground-truth labels2 , correctly paired with inputs, have been known to be a crucial factor of supervised learning, but a recent work by Min et al. (2022b) purportedly revealed the possibly counter-intuitive nature of label demonstrations in in-context learning (ICL).Specifically, the findings implied that the correctness of input-label correspondence in in-context demonstrations is not as important as we have thought.We name this phenomenon input-label insensitivity.Although the finding was supported by reasonably large-scale experiments, covering various experimental variables such as datasets, language models, in-context learning types, etc., we found that, through deeper analysis of the experiments, input-label insensitivity is not consistent across all experimental settings.Figure 2: A counter-example of slightly varied but equally valid experimental settings is shown on the right, while the results from the prior experimental settings (Min et al., 2022b) is shown on the left."No Demo" refers to the result without demonstrations and "Random Label" refers to the result with label demonstrations replaced with a random label uniformly sampled from the label space.Minor variations in the experimental settings could result in a large difference in the degree of which the ICL performance responds to the label corruption.More details on the experiment is described in Appendix A.
This section highlights the limitations of the existing work, proposes new metrics to quantify the impact of input-label correspondence, and finally presents deeper analyses of the ICL experiments utilizing the newly proposed metrics.et al. (2022b) showed that replacing groundtruth labels in prompt demonstrations with incorrect labels marginally affects the mean-aggregated overall performance on selected datasets.Although the input-label insensitivity phenomenon was less prominent on GPT-J with the direct ICL method, the ICL still performed better when entirely incorrect labels were given than the absence of demonstrations (the zero-shot baseline), allegedly supporting the input-label sensitivity idea (Min et al., 2022b).However, we argue that there are mainly two limitations to the existing claim.

Min
Over-generalization The existing claim suffers from over-generalization in two regards: (1) the mean-aggregated results fails to capture the insensitivity behavior in individual tasks and (2) the proposed experimental settings in the existing work is not general enough to be fully supportive of the claim.Mean-aggregation does not paint the full picture without the information on the variance.Furthermore, individual analyses on large-scale tasks are needed to obtain precise insight into input-label sensitivity.Our deeper analyses on the ICL experiments ( §2.4) provide more evidence of this claim.
The second over-generalization is supported by the existence of a counter-example: higher input-label sensitivity observed from a slight varied but equally valid experimental settings (Figure 2).The subfigure on the left corresponds to the result of an existing set of experimental settings, where the Noisy Channel method (Min et al., 2022a) was used for ICL, the macro-F1 score for the evaluation metric, and the five classification datasets listed in the existing work.The subfigure on the right has been obtained using (Direct) method, the accuracy score as metric and results were aggregated from all 17 datasets listed in the existing work (see AppendixA).
Lack of Quantification Existing work relies on human judgement to determine the input-label sensitivity, which could be subjective.Furthermore, we are not only interested in whether the inputlabel insensitivity phenomenon exists but also how insensitive the ICL is towards the demonstrations, enabling us to exploit the phenomenon.Hence, a set of systematic quantification methods is needed to perform the deeper analyses.

Key Concepts
This subsection establishes key concepts and notations related to our analysis on the impact of inputlabel demonstrations and the downstream ICL performance.x and c denote the input and the label respectively.They exist in each respective input (X ) or label space (C) associated with the dataset or task.A language model P predicts the next token given the preceding tokens: P (x t |x <t ).In ICL, a prompt P is designed to elicit particular behaviors from the language model.For exam-ple, to utilize the language model as a text classifier, a prompt template T takes a set of examples D ex = {(x 1 , c 1 ), ..., (x k , c k )} and a test input x to produce the prompt P. The prompt is then fed into the language model to produce the most plausible continuation: argmax x ′ P (x ′ |P).A task-specific verbalizer V is designed to interpret the generated output x ′ into the label space C. We measure the performance y of the language model P and the prompt template T on a test set D test .
Our analyses mainly involve manipulating T and the example set D ex to set-up baselines and conduct ablation studies.Key experimental set-ups include: No Demo, or denoted as "zero-shot", represents zero-shot predictions, where the prompt template T ignores D ex and only uses the test input x: P (c|x).The example set D ex in α%-Correct consists k × a/100 correct input-label pairs and k ×(1−a/100) incorrect pairs where (0 ≤ a ≤ 100).For Random Label, the labels c in D ex are replaced by uniform samples from the label space C, and it is one of the key baselines of our studies.Additional details on the set-up variations are presented in Appendix A.

Metrics for Measuring the Impact of Input-Label Demonstrations
This section proposes two new metrics to quantify the impact of input-label demonstrations in ICL.

Label-Correctness Sensitivity
We define labelcorrectness sensitivity, or sensitivity for short, as the degree of which the downstream classification performance changes when the model is subject to a fixed amount of label corruption.Sensitivity in the context of in-context learning demonstrations can be computed by conducting the single-scalar linear regression analysis on a performance metric (e.g., accuracy or F1-score) y against the percentage of examples that are labelled correctly (s): where β 0 is the bias and β 1 is the coefficient of label correctness.The scalar value of the weight parameter β 1 is interpreted as the sensitivity measure.The data points for linear regression were obtained by following the experimental protocol proposed by Min et al. (2022b).The sensitivity measure can be interpreted as a linearly interpolated measure of performance degradation for each unit decrease in label correctness.
Ground-Truth Label Effect Ratio (GLER) Another way to understand the impact of labels, namely correct or ground-truth labels, is to quantify how much the ground-truth labels improve the ICL performance compared to the random-label baseline.The higher the gap, the bigger the impact the ground-truth labels have on the performance.The gap is then normalized by the performance difference between ground-truth labels and the absenceof-demonstration baseline (zero-shot): where y GT is the ground-truth label performance, y RL the random-label baseline (Random-Label), and y ∅ the zero-shot performance.The denominator in Equation 1 is intended to allow the GLER metric to be compared across different tasks.Additionally, we clip GLER to be bounded between 0 and 1.

Deeper Analyses
This subsection performs deeper analyses using the aforementioned metrics to reveal additional insights into input-label insensitivity.

Experimental Setup
All of our experiments mentioned in the rest of the paper generally follows the experimental settings in Min et al. (2022b), where α%-Correct is mainly utilized to conduct sensitivity analysis.However, there are key differences: (1) we do not employ label-length normalization (in our experiments length normalization does not always increase the performance), and there are minor template T design differences, including how the separator token interacts with the model and the datasetspecific implementation of data preprocessor; (2) we use accuracy, instead of F1-score, as the primary evaluation metrics for ICL performance.However, we do report the full results in Appendix A, along with the full details of the setup.

Label Correctness Does Affect Performance
To analyze the overall sensitivity of performance under the variation of label correctness, we aggregate sensitivities across all 17 classification datasets and the results are shown in Table 1.The results show that the aggregated sensitivity is significantly high with good fit (in the range of 0.81-0.86)for all configurations.When tested on our specific setup, GPT-NeoX Direct 0.300 0.327 0.810 GPT-J Direct 0.309 0.291 0.861 Table 1: Aggregated linear regression analysis on the performance against the percentage of correct labels."Ours" indicates that the data points for the linear regression analysis were obtained using our proposed experimental settings (Appendix A). the sensitivity was as high as 0.309, implying that, on average, there was a 0.309% drop in accuracy for each percentage drop in label correctness.The trend of sensitivity, which is more apparent in our quantitative analysis, may have been overlooked due to the relative dwarfing effect from zero-shot (or "no demo") results in prior studies.The results also show that the sensitivity is lower in the Channel method,3 suggesting that sensitivity can be significantly lowered with the employment of more advanced ICL methods.

Label Demonstration Impact is Highly
Varied Across Tasks and Settings Although the aggregating analysis shows a general sensitive trend towards demonstration correctness, individual analyses shed deeper insight into the distribution of task sensitivities.Individual sensitivity plots are illustrated in Figure 3. Sensitivity can vary from small negative values (indicating increasing performance under increasing label corruption) to value as high as 0.815 (for the hate speech dataset), suggesting that summarizing the trend for all tasks and datasets may be difficult and that certain datasets may possess distributional properties that allow models to more easily exploit label demonstrations.This high-variance observation is valid for other metrics (GLER and the ground-truth label performance) as well.Further analyses are available in §3

Sensitivity and Task Difficulty
Tasks where the model struggle to exploit incontext demonstrations may exhibit low sensitivity towards them, since understanding patterns in demonstrations is inherently linked with the ability to absorb demonstrative label-supervision.To confirm our theory, we conduct an analysis on the sensitivities of 17 datasets against the task difficulty.We define task difficulty as the relative performance of ground-truth label demonstrations compared to a baseline.Specifically, relative performance y rel is computed by y rel = y GT − y baseline .We consider the random baseline.
Our analysis (Figure 11) shows that the model's performance sensitivity is strongly related to the difficulty of the task.The tasks, where the model exhibits low sensitivity (i.e.< 0.1), struggle to achieve meaningful classification performance.This suggests that designing experiments with datasets that can be meaningfully solved using incontext learners may be more important than previously understood.Hence, the sensitivity measure by itself is insufficient for benchmarking the impact of input-label demonstrations.
3 When Do the Ground-Truth Labels Actually (Not) Matter?
As revealed in our deeper analyses ( §2.4), many factors including datasets and the choice of the ICL method can significantly affect the label-sensitivity.
Gaining more understanding of the mechanism by which the input-label correspondence impacts the downstream ICL performance could enable us to systematically exploit the label-insensitivity phenomenon.For example, few-shot ICL models can be improved to tolerate label noises from synthetic data samples generated in the joint input and label spaces (Yoo et al., 2021).
To understand the conditions that reduce the label sensitivity, we conduct a series of experiments that investigate different factors contribute to the phenomenon quantified using the metrics proposed in §2.3.Namely, we consider the particular technical choice in carrying out ICL (whether to employ the noisy channel method (Min et al., 2022a)  Sensitivity and GLER Recall that the sensitivity measure is the nominal coefficient of the linear line fitted on the performance-versus-label-corruption data points.Since baselines can vary depending on the experimental setting, hyperparameters and the dataset4 , comparing the nominal sensitivity alone can be inconclusive, as the same degree of absolute improvement has different implications depending on the baseline level.To account for the variations in the characteristics of the task and the model, we consider GLER and the ground-truth label performance as the auxiliary measures in the following studies.

Techniques for In-context Learning
In-context learning, as first proposed by Brown et al. (2020), is a straightforward parameter-free approach, where the downstream task of interest is expressed as natural text demonstrations and used to conditionally generate from a language model.Recently, Min et al. (2022a) proposed Noisy Channel (denoted as Channel) that exploits the language generation capability of language models for discriminative tasks using the Bayes' Theorem.We compare the two ICL methods on all three (sensitivity, GLER, and the ground-truth label ICL accuracy) measures.
Results (Figure 5) show that Channel reduces the label-sensitivity on average compared to the original Direct method while maintaining the Accuracy on similar levels.The label insensitivity effect is observed in both GPT-NeoX and GPT-J.
Another recent advance in ICL, namely Calibrate Before Use (CBU), involves calibrating the output likelihood of the word tokens that correspond to the labels (Zhao et al., 2021).We conduct the same set of experiments with CBU applied and report all three metrics.As shown in Figure 6, the calibration technique reduces the label sensitivity while generally improving the ICL performance on both GPT-J and GPT-NeoX.Applying CBU can be an effective way to reduce label sensitivity while not sacrificing the performance.

Prompt Templates
Various design choices in in-context prompt templates have significant impact on the downstream ICL performance (Reynolds and McDonell, 2021).
A well-designed and verbose prompt template (e.g., a prompt with detailed description of the task) could allow in-context label demonstrations to have relatively less impact on ICL, thereby reducing the label-demonstration sensitivity.
This section mainly explores (1) the number of in-context examples and (2) the level of task description details.To quantify the impact of the number of in-context examples, we conduct the same set of experiments with varying number of in-context examples, ranging from 1 to 16. Results (Figure 7a) unsurprisingly show that the number of prompt examples is positively linked to all three metrics.Although sensitivity rises with the number of examples, this is due to the final ICL performance and the impact of ground-truth labels improving with more demonstration examples.
We also hypothesize that the level of task details contained in the prompt template also serves to relatively weaken the label demonstration impact.Results in Figure 7b confirm our hypothesis.

Model Sizes
The scale of the language model could influence how susceptible the model is to label noises within input-label demonstrations.The larger the model is, the more prior knowledge the model could leverage to reduce label sensitivity.To study whether this is the case, we analyze five different sizes of GPT-style language models, ranging from GPT-2 XL to GPT-35 .The choice of models and the corresponding number of parameters are listed in Figure 8. Results show that sensitivity is generally correlated with the model size, but we also observe a plateauing phenomenon after the GPT-J 6B scale.However, the results on the ICL performance with ground-truth label demonstrations shows that the performance scales well beyond the 6B mark,

Discussion
This section provides additional evidence that the demonstration of ground-truth labels can be more important than the previous finding suggests and that existing interpretation of the experimental results may have been obfuscated by the entanglement of various aspects of demonstrations.

The Complementary Relationship between Input-label Correspondence and Label-space Demonstrations
Input-label correspondence is just one of the aspects of possible in-context label demonstrations, the others including label-space demonstration.However, it is unclear whether label-space and input-label correspondence can complement each other in the absence of explicit demonstration of the other.For example, pretrained language models may be able to deduce sentence-sentiment mappings from the mentions of sentiment labels alone through inductive bias.
Prior work (Min et al., 2022b) showed significant performance degradation in the absence of both aspects of label demonstration, but the results beg the question: could the significant degradation have been caused by complete lack of label demonstration?To find out, we conduct additional ablation studies to study the performance under the demonstration of input-label pairings but not of the explicit label space which we call prior-free label experiments.Specifically, we study the case where class labels are replaced with prior-free labels while maintaining the correspondence between the input and the labels.For example, "positive" and "negative" labels in sentiment analysis can be replaced with "0" and "1" labels respectively, which do not reveal the information about the labels themselves.However, language models can still capture mild labelassociations in abstract symbols through inductive bias (Ouyang et al., 2022).To diversity "prior-free" choices, we consider (1) random tokens from the language model's word space, (2) alphabets, and (3) numerical labels 6 .
As shown in 9, results on prior-free labels outperform that of the random labels (with random inputlabel mappings), indicating that language models are capable of capturing the input-label correspondence even in the absence of label-space demonstrations.Among the prior-free results, we note that the alphabetical and numerical labels outper- 6 We exclude "0" since it is often associated with the state of nil form random-token labels.This could be explained by the fact that, since random word tokens may introduce unintended biases through misleading association with unrelated word semantics, abstract labels provide better prior-free environment.
4.2 Change in label distribution may result the higher sensitivity.
The distribution of labels in demonstration is one of the critical factor for the prediction (Zhao et al., 2021).When data imbalance exists, corrupting the labels cause distributional shift which may lead performance change regardless of the inputlabel mappings.High sensitivity in imbalanced dataset may be due to this unintentional distributional shift.To analyze the impact of distributional shift, we conducted additional experiments using label balanced demonstrations for imbalanced dataset (hate_speech18, ethos-race, ethos-national_origin, ethos-religion).
As shown in 10, using balanced demonstrations degrade the performance and sensitivity when com- Here, the labels are replaced with tokens that are unrelated to the label semantics while still maintaining the input-label mappings.The replacement tokens include alphabet tokens, numeric tokens, and random word tokens from the language model's word space ("rand token").The baselines obtained from the ground-truth labels and random labels are denoted as "GT" and "rand label" respectively.Results strongly suggest that language models are still able to utilize input-label demonstrations without access to label priors.
pared to demonstrations sampled from data distributions which supports our suspicion.On the other hand, average sensitivity are 0.189 and 0.308 (for GPT-NeoX and GPT-J respectively) even in balanced demonstrations setting which supports the importance of input-label demonstrations.

Related Work
As the scale of language models becomes larger (Rae et al., 2021;Chowdhery et al., 2022;Smith et al., 2022;Thoppilan et al., 2022), fine-tuning becomes prohibitively expensive due to the space and time complexities.As an alternative, in-context learning (ICL) (Brown et al., 2020) has shown to be an effective parameter-free learning strategy by prompting language models with task-specific prompt templates.Since then, a plethora of works has investigated both the properties of the learning mechanism and Schütze, 2021b; Reynolds and McDonell, 2021;Kim et al., 2021;Zhao et al., 2021;Lu et al., 2022;Min et al., 2022b).Although numerous efficient fine-tuning strategies have been proposed in the past (Li and Liang, 2021;Hu et al., 2022;Lester et al., 2021), the absence of an explicit training step in ICL has enabled it to retain its own class of adapting large-scale language models.Figure 10: The effect of using label balanced demonstrations in 5 imbalanced datasets.Employing the balanced demonstrations degrade all metrics due to the distributional shift in label demonstrations.However, sensitivity is still significant which supports the importance of input-label demonstrations.

Conclusion and Future Work
In this work, we took a closer look at how inputlabel relationships affect the in-context learning performance.To quantitatively analyze the impact of input-label mappings in in-context learning, we proposed novel metrics, GLER and input-label sensitivity.Through extensive experiments, we found that the integrity of the input-label mapping is a crucial factor in performing ICL.We also conducted ablation studies to reveal various conditions that allow ICL to improve insensitivity towards label corruptions (while still maintaining a healthy performance).For future work, based on the current findings, we will investigate whether we could exploit data augmentation for extremely low-resource situations for ICL.

Limitations
PLMs are over sensitive to the choice of prompts.
As it is widely known that performance of the PLMs is highly sensitive to the choice of the prompts (Brown et al., 2020;Lu et al., 2022;Zhao et al., 2021).Prompt engineering to find the optimal prompt was not feasible considering the amount of datasets and settings that we experimented.The findings from this work may differ depending on the choice of prompts.However, to minimize this limitations the templates and prompts are adopted from well studied previous works as much as possible.
Ground-truth label demonstrations are just one piece of the puzzle.According the full analysis from Min et al. (2022b), other components of demonstrations not covered in this paper (e.g., input-space demonstrations) exhibit even stronger impacts on ICL.Although our experiments were designed to analyze solely the impact of input-label correspondence, disentangling diverse aspects of demonstrations is highly difficult as mentioned in section 4. Other factors such as label distribution may have unexpectedly influenced the results.
Huggingface Implementation.We use Huggingface implementation of GPT-NeoX.To our knowledge, current version of GPT-NeoX in Huggingface under performs when compared to the original implementations from Black et al. (2022).

A.2 Full Dataset
We evaluate on 17 text classification datasets covering diverse tasks including sentiment analysis, paraphrase detection, natural language inference, hate speech detection and diverse domains including science, social media, finance, and more.All datasets are from Huggigface datasets (Lhoest et al., 2021).Full list and details about the datasets are provided in Table 2.
As mentioned in Section 2.4.4,sensitivity highly depends on relative performance.In order to effectively capture correlation between sensitivity and diverse factors in Section we evaluate on subset of 8 datasets, datasets with high relative performance, in Section 3. 8 datasets include glue-sst2, glue-rte, super_glue-cb, trec, financial_phrasebank, medical_questions_pairs, sick, and tweet_eval-hate.Due to limited resources, we only run experiments on 6 datasets in Section 3.3.

A.3 Metric
We use accuracy as our primary metric.Accuracy is commonly used metric in multi-class classification which intuitively show how well the model performs.F1 score takes into account how the data is distributed thus it is useful when you have data with imbalance classes.However, F1 is less intuitive since it measures the trade-off between precision and recall.Moreover, F1 score can vary regarding the averaging method in multi-class classification.

A.4 Template
We use 3 types of templates regarding engineering cost and verbosity of templates.First, as a baseline template we used minimal template following (Ye et al., 2021;Min et al., 2022b).We use minimal template throughout the paper.For ablation 3.2, we also evaluate manual templates and Verbose template.Templates are adopted from prior works (Brown et al., 2020;Zhao et al., 2021;Min et al., 2022b;Bach et al., 2022) if possible.Details and examples regarding the templates are in Table 3.Additionally, for Section 3.1 CBU experiment we use Manual template as the baseline since in our preliminary experiments, applying CBU in Minimal template degrade the performance in some cases.
Even though we use the same minimal template as Min et al. (2022b), there are minor difference in dataset-specific implementation of data preprocessor.(e.g., input sentences of glue_mrpc dataset used in Min et al. (2022b) have prefix "sentence1: ") Therefore, LMs may have slightly different behavior with same the dataset.

A.5 Other details
Unless otherwise specified, we use k = 16 examples as demonstrations which are sampled at uniform from the training data.We run all experiments 5 times using different seeds.Due to limited resources, we only run experiments once for GPT-3.For all models expect for GPT-3, we used implementation and models from Huggingface transformers library (Wolf et al., 2020).For GPT-3 we used OpenAI API, assuming that model "davinci" is GPT-3 175B.When calculating the probability of label tokens, we do not normalize the score by the length of the tokens unlike in Min et al. (2022b).Our implementation is available at https://github.com/juny116/ICL-DeepSpeed.

A.6 Corrupting input-label mapping
To see the detail impact of the ground truth inputlabel mapping, we revisit the experiments from Min et al. (2022b) Specifically, we replace fix amount of correct labels to incorrect labels in demonstrations and compare the end task performance.
• No demonstrations is a zero-shot prediction made via argmax y∈C P (y|x), where x is the test input and C is a small discrete set of possible labels.Verbalizers are used for mapping tokens to class.
• Demonstrations w/ a% correct labels consist k×a/100 correct pairs and k×(1−a/100) incorrect pairs where (0 ≤ a ≤ 100).A concatenation of k input-label pairs where a% labels are correct is used to make a prediction via argmax y∈C P (y|x 1 , y 1 , ..., x k , y k , x).
that are randomly sampled at uniform from C. Since the labels are sampled at uniform from C, the distribution of labels in demonstration may change from sampled inputs.
• Demonstrations w/ shuffled label is formed with randomly shuffling correct labels to other labels within the sampled k inputs.The distribution of labels in demonstration does not change from sampled inputs.
• Majority class baseline is a ratio of majority class within the test data.Since there are some datasets that have distributional imbalance, this can be a good indicator of how well the in-context learning is working.

B Full Results
Full experiment results on 17 datasets with GPT-NeoX are in Table 4 and results on 17 datasets with GPT-NeoX are in Table 5.

C More Results on the Sensitivity vs Task Difficulty Plot
Figure 11 shows scatter plots of sensitivities of 17 datasets against the corresponding task difficulties measured using the relative performance with respect to accuracy and F-1 scores.The Direct approach is colored in orange and the Channel approach is colored in blue.The dashed vertical line indicates a neutral performance level where there is no difference with the random baselines.The bestfit linear lines show a general trend of increasing sensitivity with less task difficulty.Low sensitivity is strongly related to high task difficulty.Also, the Channel approach helps in alleviating hypersensitivity towards task difficulty.

D Label-Correctness Correlation
The first step of understanding the interaction between performance and input-label demonstration is quantifying the correlation between the two variables.Although we considered this metric as one of the foundation quantifying measures, we omit the analyses results due to space constraints.The Pearson correlation analysis on GPT-J and the Direct approach (Figure 12) shows that the label-correctness correlation is strong (i.e.larger than 0.9) for most tasks on all performance measures.The macroaverage correlation across 18 tasks is 0.895 with a p-value of 0.057, strongly supporting the linkage.

Figure 3 :
Figure3: Individual scatter-plots of the proposed metrics, sensitivity and GLER, across two models (GPT-NeoX and GPT-J) and 17 datasets.We also report the nominal ground-truth label accuracy values to further showcase the highly varied nature of the tasks.

Figure 4 :
Figure4: A scatter plot of sensitivities of 17 datasets against the corresponding task difficulties measured using the relative performance.The Direct approach is colored in orange and the Channel approach is colored in blue.The dashed vertical line indicates a neutral performance level where there is no difference with the random baselines.More details is found in Appendix C.
and the likelihood calibration (Zhao et al., 2021)), various properties of the prompt template (the number of in-context examples and the verbosity), and the model size.

Figure 5 :
Figure 5: The effect of switching the ICL inference method from Direct to Channel.Employing the Noisy Channel method improves insensitivity while improving the overall ICL performance.

Figure 7 :Figure 8 :
Figure7: Results for varying prompt sizes and prompt verbosity.The sensitivity, impact ratio, and final ground-truth label performance are all positively correlated with the number of prompt examples.For template verbosity, the sensitivity and the impact ratio decreases with the increase in versbosity, but the performance does not deterioriate.Results for GPT-NeoX (20B) are colored blue, while GPT-J (6B) is colored red.

Figure 9 :
Figure9: The results of "label prior-free" experiments (on 8 text classification datasets), where we control the prior information of the class labels.Here, the labels are replaced with tokens that are unrelated to the label semantics while still maintaining the input-label mappings.The replacement tokens include alphabet tokens, numeric tokens, and random word tokens from the language model's word space ("rand token").The baselines obtained from the ground-truth labels and random labels are denoted as "GT" and "rand label" respectively.Results strongly suggest that language models are still able to utilize input-label demonstrations without access to label priors.

Figure 11 :
Figure 11: Scatter plots of sensitivities of 17 datasets against the corresponding task difficulties measured using the relative performance with respect to each metrics.

sickQuestion:
Does the first sentence entails the second sentence?True, False, or Not sure?The young boys are playing outdoors and the man is smiling nearby The kids are playing outdoors near a man with a smile Answer: True, Not sure, False tweet_eval-hate Question: Does the tweet convey the author's hatred towards something or someone?True or False?Hundreds of Syrian refugees return home from Lebanon -ABC News Answer: True, FalseTable 3: Examples of Manual and Verbose templates.Texts in blue are manual templates.

Figure 12 :
Figure 12: Pearson correlation analysis on all 18 tasks.A strong positive correlation is observed for all tasks and metrics, except for outliers.
CCA then appealed to the state Supreme Court .The question is: The DVD CCA appealed that decision to the U.S. Supreme Court .True or False?answer:Oil prices fall back as Yukos oil threat lifted The question is: Oil prices rise.True or False?answer:That was then, and then's gone.It's now now.I don't mean I 've done a sudden transformation.The question is: she has done a sudden transformation True or False?answer:The young boys are playing outdoors and the man is smiling nearby The question is: The kids are playing outdoors near a man with a smile True or False?answer: Does the following two sentences mean the similar thing?True or False?The DVD-CCA then appealed to the state Supreme Court .The DVD CCA appealed that decision to the U.S. Supreme Court .Does the first sentence entails the second sentence?True or False?Oil prices fall back as Yukos oil threat lifted Oil prices rise.Does the first sentence entails the second sentence?True, False, or Neither?That was then, and then's gone.It's now now.I don't mean I 've done a sudden transformation.she has done a sudden transformation Answer: