On the Relation between Sensitivity and Accuracy in In-context Learning

In-context learning (ICL) suffers from oversensitivity to the prompt, making it unreliable in real-world scenarios. We study the sensitivity of ICL with respect to multiple perturbation types. First, we find that label bias obscures the true sensitivity, and therefore prior work may have significantly underestimated ICL sensitivity. Second, we observe a strong negative correlation between ICL sensitivity and accuracy: predictions sensitive to perturbations are less likely to be correct. Motivated by these findings, we propose \textsc{SenSel}, a few-shot selective prediction method that abstains from sensitive predictions. Experiments on ten classification datasets show that \textsc{SenSel} consistently outperforms two commonly used confidence-based and entropy-based baselines on abstention decisions.


Introduction
Few-shot learning (FSL) refers to a system's ability to quickly learn a new task based on a few labeled examples.Recently, in-context learning (ICL) has made significant progress in FSL, where a language model (LM) is prompted with a few demonstrated examples that enable it to make predictions for new examples without any gradient update.However, a known issue of ICL is that it is oversensitive to the prompt (Zhao et al., 2021;Perez et al., 2021), making it less reliable in practice.Despite a near-universal acknowledgment of this issue, when and how the prediction is sensitive remains unclear (Min et al., 2022b;Kim et al., 2022).This paper fills these gaps.
We conduct a systematic study of the ICL sensitivity to prompt perturbations.Specifically, we perturb the task instruction (by paraphrasing and noise injection) and the in-context example orders.We then measure the prediction sensitivity by the magnitude of model output changes due to the prompt perturbation.
Our first observation is that the extent of sensitivity is significantly underestimated due to label bias in ICL: LMs tend to assign a higher probability to a specific label regardless of the prompt (Zhao et al., 2021), thus appearing to make stable predictions.Our study shows that the adjusted sensitivity after mitigating label bias is up to 2.8x of the raw sensitivity.
After mitigating label bias, we observe a negative correlation between the adjusted sensitivity and the accuracy of ICL: if a prediction is sensitive to prompt perturbations, then it is likely to be incorrect (Figure 1 left).This finding aligns with our intuition that if the prediction is sensitive to the prompt that elicits the LM concept (e.g., sentiment) (Xie et al., 2022), then the example is likely not typical for that concept, and is thus more challenging.Our experiments show a significant negative correlation of up to −0.40 (Pearson) between ICL sensitivity and accuracy.
Given the above findings, a natural idea is to use sensitivity as a signal to abstain from making predictions on error-prone examples-an important mechanism to increase user trust when deploying ICL models to high-stakes domains such as healthcare (Korngiebel and Mooney, 2021;Sezgin et al., 2022) and legal systems (Eliot and Lance, 2021).Our proposed method, Sensitivity-based Selective prediction (SENSEL), uses sensitivity to make abstention decisions: the LM abstains on examples where its prediction is sensitive to prompt perturbations (Figure 1 right).Compared to the common approach of training a separate model to make abstention decisions (Platt et al., 1999;Geifman and El-Yaniv, 2019;Kamath et al., 2020), our approach does not require large amounts of labeled data and thus is more suitable for the few-shot setting.
Our experiments show that sensitivity is a stronger signal than output probabilities for abstention.SENSEL consistently outperforms two baselines based on model probabilities (MAXPROB Perturb Task Instruction

ICL Sensitivity Study
In this section, we study the interplay between label bias and prediction sensitivity in ICL, as well as the relation between sensitivity and accuracy.

ICL Sensitivity
Background In-context learning is a FSL method using LMs.Given a test example x, we concatenate the task instruction I, a few (K) labeled examples S = [(x σ(i) , y σ(i) )] K i=1 in σ order, and the test input x.The probability of each label is then assigned by the next-word probabilities from the LM.We use p LM (y | x, I, S, σ) to denote the prediction probabilities, and f (x, I, S, σ) = arg max y p LM (y | x, I, S, σ) to denote the predicted (most likely) label.
Despite its success, ICL is known to be highly sensitive.Several methods have been proposed to address this issue.Zhou et al. (2022) 2022) search for highperformance prompts that lead to less sensitive predictions.In contrast to these works, we connect ICL sensitivity to label bias and prediction accuracy, and propose a new few-shot selective prediction approach based on sensitivity.
Measuring Sensitivity We measure prediction sensitivity by the magnitude of the changes in the predicted label when the prompt is perturbed.We perturb the task instruction and the order of the in-context examples respectively.Formally, we measure the sensitivity of a prediction f (x, I, S, σ) with respect to perturbation set P as We use three perturbation sets.Figure 2: We compare the raw sensitivity with the adjusted sensitivity (label bias mitigated with PC).We observe that the adjusted sensitivity is consistently higher than the raw sensitivity for all three perturbation sets for both GPT-J and GPT-NEO.Error bars represent 95% confidence intervals.Table 1: We report the Pearson correlation coefficient (and its standard deviation in parenthesis) between ICL sensitivity and accuracy across five randomly sampled sets of few-shot examples (label bias mitigated with PC).We observe a strong negative correlation between ICL sensitivity and accuracy for all perturbation sets and both models.
ple Ordering Perturbation (EXORD) permutes the ordering of the in-context examples.
Confounding with Label Bias One known issue of ICL is label bias, where LMs assign a higher probability to a specific label regardless of the prompt, and hence appear to make stable predictions when the prompt is perturbed.Prior work mitigates label bias by adjusting the decision boundary.For example, contextual calibration (CC) renormalizes the predicted label distribution such that it is uniform given null examples (Zhao et al., 2021).Prototypical calibration (PC) clusters the LM's predictions, maps each cluster to a label, and make predictions for new examples by their most likely cluster assignments (Han et al., 2022).

Experimental Setup
We first compare the raw sensitivity with the adjusted sensitivity.We then compute the Pearson correlation coefficient (Freedman et al., 2007) between the adjusted sensitivity and accuracy.
We run experiments on ten classification datasets covering sentiment classification, emotion classification, topic classification, and question-answering.See Appendix A for dataset details.We use GPT-J-6B (Wang and Komatsuzaki, 2021) and GPT-NEO-2.7B(Black et al., 2021;Gao et al., 2020)

Findings
Sensitivity is underestimated due to label bias.
We report raw and adjusted sensitivity with respect to each perturbation set in Figure 2. We observe on both models and all three perturbation sets that ICL becomes more sensitive when label bias is mitigated.After prototypical calibration, the adjusted sensitivity is 99.0%higher.Therefore, we argue that the true sensitivity may have been significantly underestimated if label bias is not mitigated.Among the three perturbation sets, ICL is most sensitive to human instruction perturbations: the perturbations cause the predicted label to change 43% of the time on GPT-J-6B and 50% of the time on GPT-NEO-2.7B(after mitigating label bias).This may be caused by the semantic difference in various human instructions for the same task, such as changing "Is this product review positive?" to "Based on this review, would the user recommend this product?".
Sensitivity is negatively correlated to accuracy.After mitigating label bias, we measure the Pearson correlation coefficient between sensitivity and accuracy (Table 1).We observe a significant negative correlation between sensitivity (with respect to all perturbation sets) and accuracy across datasets.The correlation is strongest for human instruction perturbations (−0.39 on both models).

Sensitivity-based Selective Few-shot Prediction
Motivated by the correlation between the sensitivity and accuracy of ICL, we propose SENSEL-a selective few-shot prediction method based on sensitivity.

Problem Statement
The goal of selective prediction is to abstain on examples that the model is not confident about, to avoid presenting wrong predictions to users (Chow, 1957;El-Yaniv and Wiener, 2010).Selective prediction methods score model confidence C on each example, and abstain on examples with low prediction confidence (C < γ), where γ is a threshold to control the trade-off between accuracy and coverage.
Sensitivity-based Selective Prediction SENSEL scores ICL prediction confidence as the negative value of the prediction's sensitivity to prompt perturbations, and then abstains on examples whose confidence scores (i.e., negative sensitivity scores) are below a certain threshold γ.
Experiment Setup For SENSEL, we always use the adjusted sensitivity computed after mitigating the label bias.As writing good task instructions can be hard (Gao et al., 2021), we experiment with two settings: INST (a task instruction is available), and NO INST (no task instruction is available).We perturb the task instruction in the INST setting (SENSEL-INSTH, SENSEL-INSTA), and perturb the example ordering in the NO INST setting (SENSEL-EXORD).We compare SENSEL to two simple yet strong baselines, MAXPROB, which uses the maximum output probability over the labels as the confidence score (Hendrycks and Gimpel, 2017;Lakshminarayanan et al., 2017), and ENTROPY, which uses the negative value of the entropy of the output probabilities over the labels as the confidence score (Wan, 1990).We evaluate the effectiveness of selective prediction methods with the area under the F1-Coverage curve (AUC), which measures the average F1-score at different coverage rates (Kamath et al., 2020).For label bias mitigation, since the same conclusion holds for PC and CC, we report the results for PC in the main paper and the results for CC in Appendix C.2.
Results According to Figure 3, SENSEL consistently outperforms MAXPROB and ENTROPY.Among the three perturbation sets, SENSEL with human-written instruction perturbations performs the best (outperforming MAXPROB by an average margin of +4.1 AUC points on GPT-J-6B and +5.0 AUC points on GPT-NEO-2.7B),which is consistent with our sensitivity study that sensitivity to human-written instructions has the strongest correlation with accuracy.Even when instructions are not available, SENSEL-EXORD outperforms MAXPROB and ENTROPY consistently by an average margin of +3.0 AUC points on GPT-J-6B and +1.4 AUC points on GPT-NEO-2.7B.
To understand how well SENSEL and MAX-PROB perform on different tasks, we analyze the two methods on tasks with different prediction sensitivity.Specifically, we measure the correlation between task sensitivity and task abstention performance (measured by the AUC of each abstention method minus that of a random abstention baseline).Results show that MAXPROB works better on tasks with low prediction sensitivity (Pearson correlation −0.17), while SENSEL works better on tasks with high prediction sensitivity (correlation +0.28) (Figure 2, Figure 3).Hence, SENSEL and MAXPROB are complementary: MAXPROB falters on high-sensitivity tasks (e.g., DBP) because it relies on oversensitive model probabilities for abstention, while SENSEL capitalizes ICL sensitivity for abstention and hence works even better on high-sensitivity tasks.

Conclusion
While ICL sensitivity is a widely-known issue, its relation to other variables is not studied.This work first conducts a comprehensive study, and finds that ICL sensitivity is negatively correlated with accuracy when label bias is mitigated.Based on this observation, we develop a few-shot selective prediction method that abstains on highly sensitive predictions.Our results show that ICL sen-sitivity exhibits a useful pattern-it reflects how confidently an LM understands the task.
There are many open questions for future work.First, our study of the sensitivity-accuracy relation is correlational but not causal.Future work should explore causal experiments to study whether ICL predictions are sensitive because they are uncertain.Second, it remains unclear why sensitivity is negatively correlated with accuracy in ICL, which requires a better understanding of the mechanism of ICL.Third, our work mainly focuses on the text classification tasks.Future work can further explore other tasks such as text generation and question answering with structured output.

Limitations
First, our study of the sensitivity-accuracy relation is correlational but not causal.Future work should explore causal experiments to study whether ICL predictions are sensitive because they are incorrect.Second, it remains unclear why sensitivity is negatively correlated with accuracy in ICL, which requires a better understanding of the mechanism of ICL.Third, our work mainly focuses on the text classification tasks.Future work can further explore other tasks such as text generation.

A Datasets
We study ICL sensitivity and few-shot selective prediction on the following datasets: AG News (Zhang et al., 2015) Perturbation Set For human instruction perturbation, we use task instructions from PromptSource (Bach et al., 2022), which provides on average 7 task instructions for each task.For automatic instruction perturbation, we generate 10 perturbed instructions by randomly dropping out 20% of the tokens in the instruction, and another 10 perturbed instructions by using a neural paraphrase model.
We use a T5 model fine-tuned on the Google PAWS dataset (Zhang et al., 2019) as the paraphrase model and decode with nucleus sampling of top-p = 0.9.

C.1 ICL Sensitivity Study
Confounding Label Bias We report raw and adjusted sensitivity (label bias mitigated by CC) in Figure 4. Similar to our observations on PC, ICL becomes more sensitive when label bias is mitigated with CC.We also show the sensitivity scores for raw, CC and PC as table in Table 2.

Sensitivity-Accuracy Correlation
We report the correlation between prediction sensitivity and accuracy for raw and CC in Table 3.Similar to our observations on PC, there is a significant negative correlation between sensitivity and accuracy across datasets for both raw and CC.

C.2 Sensitivity-Based Selective Few-shot Prediction
Similar to results on PC, all three variants of SENSEL consistently outperform both MAXPROB and ENTROPY when CC is used to mitigate label bias (Figure 5).Among the three perturbation sets, SENSEL with human-written instruction perturbations performs the best (outperforming MAXPROB and ENTROPY by +3.9 AUC points on GPT-J-6B and +0.8 AUC points on GPT-NEO-2.7B).Similar to results on PC, SENSEL-EXORD outperforms MAXPROB and ENTROPY consistently even when instructions are not available.We also show the AUC scores as table in Table 4,5.
We also plot the Coverage-F1 curves, which show coverage rates at different F1 thresholds (Figure 6).The coverage-F1 curves for SENSEL-INSTH and MAXPROB further verify that SENSEL consistently outperforms MAXPROB on different thresholds (Figure 6).

Figure 5 :Figure 6 :
Figure 5: We compare our SENSEL method (confounding label bias mitigated by CC) to the MAXPROB baseline.SENSEL consistently outperforms MAXPROB under both the INST setting and the NO INST setting.
Figure1: ICL sensitivity-accuracy correlation (left): We plot the prediction sensitivity against the prediction accuracy averaged over examples with that sensitivity.Different colors represent different perturbation sets (Section 2.1), and color bands represent 95% confidence intervals.We observe a significant negative correlation between the prediction sensitivity and accuracy of ICL.SENSEL (right): SENSEL measures the sensitivity of model predictions to prompt perturbations, and abstains from making predictions on examples with high sensitivity.
and ENTROPY) by up to +4.1 AUC points.Further analysis shows that SENSEL and MAXPROB are complementary-MAXPROB falters on highsensitivity tasks because it relies on oversensitive model probabilities for abstention, while SENSEL capitalizes ICL sensitivity for abstention and hence works better on high-sensitivity tasks.1Ourcontributions are as follows.(i) We find as our models.We describe additional implementation details in Appendix B. For label bias mitigation, because the same observations hold for PC and CC, we report PC results in the main paper and CC results in Appendix C.1.We compare our SENSEL method (label bias mitigated with PC) to the MAXPROB baseline on abstention, measured by AUC score.SENSEL consistently outperforms MAXPROB on both the INST and NO INST setting.
ZhixiongHan, Yaru Hao, Li Dong, and Furu Wei.2022.Prototypical calibration for few-shot learning of language models.ArXiv.Dan Hendrycks and Kevin Gimpel.2017.A baseline for detecting misclassified and out-of-distribution examples in neural networks.In International Conference on Learning Representations.
Yonatan Geifman and Ran El-Yaniv.2019.Selec-tiveNet: A deep neural network with an integrated reject option.In Proceedings of the International Conference on Machine Learning.Bo Pang and Lillian Lee.2005.Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.In Proceedings of the Association for Computational Linguistics.
We set the number of shots K to four because the performance flattens out beyond four examples in our setting.All results are averaged over five randomly sampled sets of few-shot examples.
Figure4: We compare the raw sensitivity with the adjusted sensitivity (label bias mitigated with CC).We observe that the adjusted sensitivity is consistently higher than the raw sensitivity for all three perturbation sets (INSTH: Human Instruction Perturbation, INSTA: Automatic Instruction Perturbation, and EXORD: Example Ordering Perturbation).Error bars represent 95% confidence intervals.

Table 4 :
We compare our SENSEL method to the MAXPROB baseline and the ENTROPY baseline on the GPT-J-6B model.SENSEL consistently outperforms both baselines under both the INST setting and the NO INST setting.The standard deviation across five randomly sampled sets of few-shot examples is reported in parenthesis.