Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

In-context learning (ICL) emerges as a promising capability of large language models (LLMs) by providing them with demonstration examples to perform diverse tasks. However, the underlying mechanism of how LLMs learn from the provided context remains under-explored. In this paper, we investigate the working mechanism of ICL through an information flow lens. Our findings reveal that label words in the demonstration examples function as anchors: (1) semantic information aggregates into label word representations during the shallow computation layers' processing; (2) the consolidated information in label words serves as a reference for LLMs' final predictions. Based on these insights, we introduce an anchor re-weighting method to improve ICL performance, a demonstration compression technique to expedite inference, and an analysis framework for diagnosing ICL errors in GPT2-XL. The promising applications of our findings again validate the uncovered ICL working mechanism and pave the way for future studies.


Introduction
In-context Learning (ICL) has emerged as a powerful capability alongside the development of scaledup large language models (LLMs) (Brown et al., 2020).By instructing LLMs using few-shot demonstration examples, ICL enables them to perform a wide range of tasks, such as text classification (Min et al., 2022a) and mathematical reasoning (Wei et al., 2022).Since ICL does not require updates to millions or trillions of model parameters and relies on human-understandable natural language instructions (Dong et al., 2023), it has become a promising approach for harnessing the full potentiality of LLMs.Despite its significance, the inner working mechanism of ICL remains an open question, garnering considerable interest from research communities (Xie et al., 2022;Dai et al., 2022;Akyürek et al., 2022;Li et al., 2023b).
In this paper, we find that the label words serve as anchors that aggregate and distribute information in ICL.We first visualize the attention interactive pattern between tokens with a GPT model (Brown et al., 2020) on sentiment analysis (Figure 1).Initial observations suggest that label words aggregate information in shallow layers and distribute it in deep layers. 2 To draw a clearer picture of this phenomenon, we design two metrics based on saliency  scores to portray the information flow in ICL and further propose the following hypothesis: Information Flow with Labels as Anchors H 1 : In shallow layers, label words gather the information of demonstrations to form semantic representations for deeper layers.H 2 : In deep layers, the model extracts the information from label words to form the final prediction.
Two experiments are designed to validate the hypothesis using GPT2-XL (Radford et al., 2019) and GPT-J (Wang and Komatsuzaki, 2021) across several text classification benchmarks.(1) By blocking the information aggregation path to label words in certain layers, we find that such isolation in shallow layers significantly impairs model performance.This indicates that label words collect useful information during forward propagation in shallow layers.(2) We investigate the relationship between the attention distributions on the label words of the target position and the model's final prediction.Our results illustrate a strong positive correlation, where a candidate label's probability increases with more attention weight on its corresponding label token.In summary, these experimental findings suggest that our hypothesis holds well with large language models on real-world datasets.
Drawing on insights from the information flow perspective, we explore three approaches to enhance ICL's effectiveness, efficiency, and interpretability.(1) An anchor re-weighting method is introduced, which employs a learnable vector to adjust the significance of different label words in demonstrations, leading to a 16.7% average accuracy boost compared to standard ICL baselines.(2) For quicker ICL inference, inputs are compressed into pre-calculated anchor representations since model predictions primarily rely on label word activations.Testing shows a 1.8 × speedup in inference with only a minimal performance trade-off.(3) An error analysis of ICL on GPT2-XL demonstrates that the label confusion matrix aligns closely with the distance distribution of anchor key vectors, implying that errors might result from similar anchor representations.These promising applications further validate our hypothesis and shed light on future ICL studies for better transparency of LLMs.

Label Words are Anchors
This section confirms the intuitive findings using two saliency score-based metrics as discussed in § 2.1.The quantitative results lead to a proposed hypothesis for the ICL working mechanism: H 1 : In shallow layers, label words aggregate information from demonstration examples to form semantic representations for later computations.H 2 : In deep layers, the model makes predictions by extracting information from label words.The validation for these hypotheses is presented in § 2.2 and § 2.3, respectively.

Hypothesis Motivated by Saliency Scores
This section aims to discover the inherent patterns in the attention interaction between tokens for a GPT model.The saliency technique (Simonyan et al., 2013), a common interpretation tool, is employed for highlighting critical token interactions.Following common practice, we use the Taylor expansion (Michel et al., 2019) to calculate the saliency score for each element of the attention matrix: Here, A h,l is the value of the attention matrix of the h-th attention head in the l-th layer, x is the input, and L(x) is the loss function of the task, e.g., the cross-entropy objective for a classification problem.We average all attention heads to obtain the saliency matrix I l for the l-th layer.I l (i, j) represents the significance of the information flow from the j-th word to the i-th word for ICL.By observing I l , we can get an intuitive impression that as the layer goes deeper, demonstration label words will become more dominant for the prediction, as depicted in Figure 1.
To draw a clearer picture of this phenomenon, we propose three quantitative metrics based on I l .Our focus lies in three components: (i) the label words, such as "Negative" and "Positive" in Figure 2, denoted as p 1 , ..., p C , where C represents the total number of label words;3 (ii) the target position, where the model generates prediction labels (i.e., the final token in the input), which we denote as q; and (iii) the text part, i.e., the tokens before label words in the demonstration.
The definitions of the three quantitative metrics follow below.S wp , the mean significance of information flow from the text part to label words: (2) S pq , the mean significance of information flow from label words to the target position: S ww , the mean significance of the information flow amongst all words, excluding influences represented by S wp and S pq : S wp , S pq , and S ww help assess different information flows in the model.S wp indicates the intensity of information aggregation onto label words.A high S pq demonstrates a strong information extraction from label words for final decision-making.
S ww assesses average information flow among words, serving as a benchmark to gauge the intensity of the patterns identified by S wp and S pq .
Experimental Settings We choose GPT2-XL from the GPT series (Radford et al., 2019) as our primary model for investigation, due to its moderate model size (of 1.5B parameters) that is suitable for our hardware resource and its decent ICL performance (Dai et al., 2022).For datasets, we use Stanford Sentiment Treebank Binary (SST-2) (Socher et al., 2013) for sentiment analysis, Text REtrieval Conference Question Classification (TREC) (Li and Roth, 2002;Hovy et al., 2001) for question type classification, AG's news topic classification dataset (AGNews) (Zhang et al., 2015) for topic classification, and EmoContext (EmoC) (Chatterjee et al., 2019)  Results and Analysis Figure 3 reveals that: (1) in shallow layers, S pq , the significance of the information flow from label words to targeted positions, is low, while S wp , the information flow from the text part to label words is high; (2) in deep layers, S pq , the importance of information flow from label words to the targeted position becomes the dominant one.Notably, S pq and S wp usually surpass S ww , suggesting that interactions involving label words outweigh others.
Proposed Hypothesis Based on this, we propose the hypothesis that label words function as anchors in the ICL information flow.In shallow layers, label words gather information from demonstration examples to form semantic representations for deeper layers, while in deep layers, the model extracts the information from label words to form the final prediction. Figure 2 gives an illustration for our hypothesis.

Shallow Layers: Information Aggregation
In this part, we validate our hypothesis' first component.We assume that the information aggregation in ICL relies on the information flow from the text part to label tokens, which is facilitated by the transformer's attention mechanism.By manipulating the attention layer in the model to block this flow and examining the model behavior change, we validate the existence of the information aggregation process and its contribution to the final prediction.
Experimental Settings We retain the same test sample size of 1000 inputs as § 2.1.We use the same demonstration for a single random seed.To further validate our findings on larger models, we incorporate GPT-J (6B) (Wang and Komatsuzaki, 2021) in experiments, which exceeds GPT2-XL in model size and capacity.
Implementation Details To block the information flow to label words, we isolate label words by manipulating the attention matrix A. Specifically, we set A l (p, i)(i < p) to 0 in the attention matrix A l of the l-th layer, where p represents label words and i represents preceding words.Consequently, in the l-th layer, label words cannot access information from the prior demonstration text.
Metrics We use the following metrics to assess the impact of blocking information flow from the text part to label tokens: (1) Label Loyalty: measures the consistency of output labels with and without isolation.
(2) Word Loyalty: employs the Jaccard similarity to compare the top-5 predicted words with and without isolation, capturing more subtle model output alterations (See Appendix C for details).Low loyalty indicates a profound impact of isolation on model predictions.

Results and Analysis
Figure 4 illustrates a notable influence on the model's behavior when label words are isolated within the first 5 layers.Yet, this influence becomes inconsequential within the last 5 layers, or when random non-label words are used.This observation underlines the fundamental importance of shallow-layer information aggregation via label words in ICL.It also emphasizes the superiority of label words over non-label words.Further tests with variable numbers of layers reaffirm these findings (Appendix D).Moreover, similar results were obtained when testing ICL with semantically unrelated labels (refer to Appendix F.2).

Deep Layers: Information Extraction
We proceed to validate the latter part of our hypothesis that the model extracts information from label words to form the final prediction.We denote the sum of the attention matrices in the l-th layer as A l . 4In deeper layers, we find a strong correlation between the attention distributions on the label words of the target position, represented as (A l (q, p 1 ), ..., A l (q, p C )), and the model's final prediction, affirming our hypothesis.The experimental setup mirrors that discussed in § 2.2.

Experiments
We utilize the AUC-ROC score to quantify the correlation between A l (q, p i ) and model prediction, which we denote as AUCROC l for the l-th layer.
We prefer the AUC-ROC metric due to two primary reasons: (1) A l (q, p i ) might differ from the probability of the model outputting label i by a constant factor.As (5) This measure tracks the positive contribution above a baseline AUC-ROC threshold of 0.5.The value of R l signifies the proportional contribution of the first l layers to the model prediction.

Results and Analysis
Figures 5a and 5b delineate correlation metrics for GPT2-XL and GPT-J, averaged across four datasets.
The AUCROC l for deep layers approaches 0.8, illustrating a strong correlation between the attention distributions on label words of the target position and the model's final prediction.Moreover, shallow layers show negligible cumulative contributions (R l ), with a significant increase in middle and deep layers.These results signify the crucial role of deep layers for final prediction, validating that the model extracts information from label words in deep layers to form the final prediction.

Discussion of Our Hypothesis
In § 2.2, we have affirmed that the model's shallow layers assemble information from demonstrations via label words to form semantic representations.In § 2.3, we verify that the aforementioned aggregated information on label words is then extracted to form the final prediction in the deep layers.Recognizing the crucial function of label words in this process, we have introduced the term "Anchors" to denote them.Given the considerable role these "anchors" fulfill, we find it intuitive to design ICL improvements based on them, as elaborated in § 3.

Applications of Our Anchor-Based Understanding
With insights from the validated hypothesis, we propose strategies to boost ICL's accuracy and inference speed.We propose an anchor re-weighting method in § 3.1 to adjust the demonstrations' contributions and improve accuracy.In § 3.2, we explore a context compression technique that reduces original demonstrations to anchor hidden states to speed up ICL inference.Besides, in § 3.3, we utilize anchor distances to perform an analysis to understand the errors ICL made in real-world scenarios.These approaches corroborate our hypothesis, pointing to potential paths for future ICL enhancements.

Anchor Re-weighting
Based on our analysis in § 2, we draw parallels between ICL and logistic regression and propose an approach to improve ICL's accuracy by reweighting label anchors.
3.1.1Method § 2.3 illustrates a strong correlation between the model's output category and the attention distribution (A (q, p 1 ) , . . ., A (q, p C )) on label words p 1 , ..., p C of the target position q in deep layers.
We can view the attention module as a classifier f , By setting q q / √ d = x and k p i − k p C = β i , we deduce: This approximates a logistic regression model where: In this equation, β i 0 and β T i are parameters that can be learned, while x is the input feature.
Inspired by the similarity between ICL and logistic regression, we've incorporated a learnable β i 0 into Eq.( 7), which is equivalent to adjusting the attention weights A(q, p i ): Each β i 0 is a learnable parameter, set uniquely for different attention heads and layers.Refer to Appendix G for more details.
To train the re-weighting vector β = β i 0 , we utilize an auxiliary training set (X train , Y train ).
Here, we perform ICL with normal demonstrations and optimize β with respect to the classification loss L on (X train , Y train ): This approach can be metaphorically described as "re-weighting the anchors," leading us to term it as Anchor Re-weighting.It can also be viewed as a modification of the demonstration contributions since demonstration information has been incorporated into the anchors as suggested by our prior analysis in § 2.2.Additionally, it can be interpreted as a unique adapter variant, introducing minimal parameters while preserving most of the original model.However, it is specifically designed based on our anchor hypothesis and requires fewer parameters than traditional adapters.

Experiments
We choose one sample per class as normal demonstrations and choose four extra samples per class to form the auxiliary training set (X train , Y train ).The setup follows § 2.2, with results averaged over five random seeds.Owing to computational constraints, we employ GPT2-XL for evaluation, excluding GPT-J.The parameters β i 0 are trained using gradient descent.More details can be found in Appendix H.
We compare Anchoring Re-weighting with two baselines: (1) Vanilla ICL with the same demonstration (1-shot per class) (2) Vanilla ICL, where the auxiliary training set of β is included as demonstrations (5-shot per class) for a fair comparison.

Results
As Table 1 shows, the proposed anchor reweighting significantly enhances ICL performance, particularly on the SST-2 and EmoC datasets.Besides, adding more demonstrations for vanilla ICL may not bring a stable accuracy boost due to the potential noise introduced, as discussed in Zhao et al. (2021).Different from vanilla ICL which utilizes the extra examples to form a demonstration, we train a re-weighting vector β to modulate label anchor contributions.This shortens the input context and thus brings (almost) no extra cost to the inference speed.The consistent improvements of our method suggest that the re-weighting mechanism could be a better alternative to utilize demonstration examples.Furthermore, it reiterates the crucial role that anchors play in ICL.

Anchor-Only Context Compression
We further explore a context compression technique that reduces the full demonstration to anchor hidden states for accelerating ICL inference.

Method
In § 2.3, we find that the model output heavily relies on the label words, which collect information The effect after adding parameter β i 0 .For AGNews, due to the length limit, we only use three demonstrations per class.Our Anchor Re-weighting method achieves the best performance overall tasks.from the demonstrations.Given the auto-regressive nature of GPT-like models, where hidden states of tokens depend solely on preceding ones, label words' information aggregation process is independent of subsequent words.This allows for the calculation and caching of the label word hidden states H = {{h i l } C i=1 } N l=1 (h i l is the l-th layer's hidden state of the i-th label word in the demonstration).By concatenating h 1 l , ..., h C l at the front in each layer during inference, instead of using the full demonstration, we can speed up inference.
In our preliminary experiments, concatenating hidden states of label words alone was inadequate for completing the ICL task. 5This might be due to the critical role of formatting information in helping the model to determine the output space at the target position,6 as highlighted in Min et al. (2022b).As a solution, we amalgamate the hidden states of both the formatting and the label words, a method we've termed Hidden anchor .

Experiments
We follow the same experimental settings as § 2.2.We compare our Hidden anchor input compression method with two equally efficient baselines.Text anchor : This method concatenates the formatting and label text with the input, as opposed to concatenating the hidden states at each layer.Hidden random : This approach concatenates the hidden states of formatting and randomly selected nonlabel words (equal in number to Hidden anchor ).Hidden random-top : To establish a stronger baseline, we randomly select 20 sets of non-label words in Hidden random and report the one with the highest label loyalty.
The Text anchor method is included to demonstrate that the effectiveness of Hidden anchor is attributed to the aggregation of information in label words, rather than the mere text of label words.
If we find that Hidden anchor surpasses Text anchor in performance, it solidifies the notion that the aggregated information within label words carries significant importance.The Hidden random method is introduced to illustrate that anchor hidden states encapsulate most of the demonstration information among all hidden states.We assess all compression methods using the label loyalty and word loyalty introduced in § 2.2, in addition to classification accuracy.

Results
We can see from Table 2 that the proposed compression method Hidden anchor achieves the best results among all three compression methods on all metrics and for both models.For example, with the GPT-J model, the compression method with anchor states only leads to a 1.5 accuracy drop compared to the uncompressed situation, indicating that the compression introduces negligible information loss.Further, we estimate the efficiency improvements over the original ICL.As shown in Table 3, the speed-up ratio ranges from 1.1× to 2.9×, as the efficiency gain is influenced by the length of the demonstrations.We refer readers to Appendix I for  a more elaborated analysis of the speed-up ratios.
Besides, we observe that the acceleration effect is more pronounced in the GPT-J model compared to GPT2-XL, demonstrating its great potential to apply to larger language models.

Anchor Distances for Error Diagnosis
Lastly, we perform an error analysis for ICL by examining the distances between the key vectors in the attention module that correspond to the label words.

Method
Our previous analysis in § 2.3 shows a strong correlation between the model output and A(q, p i ), which is determined by q q k T p i as per Eq. 7. Should the key vectors k for label words p i and p k be similar, A(q, p i ) and A(q, p k ) will also likely be similar, leading to potential label confusion.Furthermore, considering the distribution of query vectors q q , we employ a PCA-like method to extract the components of the key vectors along the directions with significant variations in q q , denoted as k (see Appendix J for details).We anticipate that the distances between these ks can correspond to the category confusion of the model, thus revealing one possible origin of ICL errors.Here, we normalize the distances to a scale of 0-1, with 0 indicating the highest degree of category confusion:

Experiments
We utilize the GPT2-XL model and TREC dataset, as the model displays varying confusion levels between categories on this dataset.We use all 500 samples of the TREC test set and use 1 demonstration per class for convenience of analysis.
We calculate the actual model confusion score, Confusion ij , between category i and category k using the AUC-ROC metric (detailed in Appendix K).We then compare the predicted confusion score, Confusion pred ij , and the actual confusion score, Confusion ij , via heatmaps.We set undefined diagonals to 1 for better visualization.
The heatmaps display similarity in confusing category pairs, particularly in lighter-colored blocks.

Results
Figure 6 shows that the proposed approximation metric, Confusion pred ij , can identify the most confusing case (Description-Entity) and performs reasonably well for highly confusing categories (Entity-Abbreviation, Description-Abbreviation). This high correlation indicates that ICL makes errors in categories with similar label anchors.Overall, this result demonstrates that our anchor-based analysis framework could serve as an interpretation tool for better understanding ICL's errors.

Related Work
The existing literature on in-context learning analysis can be broadly divided into two streams, each focusing on different aspects.The first stream explores the influencing factors of ICL based on input perturbation, such as the order (Min et al., 2022b), the formatting (Yoo et al., 2022;Wei et al., 2022), and the selection of the demonstration (Liu et al., 2022).Designing proper demonstration construc-tion strategies (Ye et al., 2023;Li et al., 2023a) and calibration techniques (Zhao et al., 2021;Min et al., 2022a) could bring clear boosts to the ICL performance.The second stream investigates the inner working mechanism of ICL through different conceptual lenses, such as making an analogy of ICL to gradient descent (von Oswald et al., 2022;Dai et al., 2022) and viewing the process of ICL as a Bayesian inference (Xie et al., 2022).
In this paper, we provide a novel perspective by examining the information flow in language models to gain an understanding of ICL.Our approach offers new insights and demonstrates the potential for leveraging this understanding to improve the effectiveness, efficiency, and interpretability of ICL.

Conclusion
In this paper, we propose a hypothesis that label words serve as anchors in in-context learning for aggregating and distributing the task-relevant information flow.Experimental results with attention manipulation and analysis of predictions correlation consolidate the hypothesis holds well in GPT2-XL and GPT-J models.Inspired by the new understanding perspective, we propose three practical applications.First, an anchor re-weighting method is proposed to improve ICL accuracy.Second, we explore a demonstration compression technique to accelerate ICL inference.Lastly, we showcase an analysis framework to diagnose ICL errors on a real-world dataset.These promising applications again verify the hypothesis and open up new directions for future investigations on ICL.

Limitations
Our study, while providing valuable insights into in-context learning (ICL), has several limitations.Firstly, our research scope was limited to classification tasks and did not delve into the realm of generative tasks.Additionally, our hypothesis was only examined within conventional ICL paradigms, leaving other ICL paradigms such as the chain of thought prompting (CoT) (Wei et al., 2022) unexplored.Secondly, due to hardware constraints, we mainly investigated models up to a scale of 6 billion parameters.Further research that replicates our study using larger-scale models would be beneficial in corroborating our findings and refining the hypotheses set forth in our investigation.
B Results of S wp , S pq , and S ww on TREC and EmoC  is prominent, while S pq (the information flow from label words to targeted positions) is less significant.However, in deeper layers, S pq dominates.Importantly, S wp and S pq generally exceed S ww , indicating that interactions involving label words are predominant.

C Reason for Using Word Loyalty Besides Label Loyalty
Label loyalty alone may not capture changes in the probability distribution of non-label words or the relative ratio of the probability of the label words within the entire vocabulary.Word loyalty helps address this limitation, which is shown in Table 5.

D Isolating Different Numbers of Layers
We study the impact of the numbers of isolated layers, as shown in Figures 8a and 8b.It can be found that isolating shallow layers cause a significant impact, isolating deep layers has a negligible impact on the model, even when the number of isolation layers increases.This further illustrates

H Training Settings of Anchor Re-weighting
For each random seed, we fix the demonstration and sample 1000 test samples from the test datasets as described in § 2.2.The optimization of parame- ter vector β is carried out using gradient descent, specifically with the Adam optimizer (Kingma and Ba, 2015).The learning rate is set at 0.01, with β 1 = 0.9 and β 2 = 0.999.Due to memory constraints, we use a batch size of 1.This optimization process is repeated for 10 epochs.Owing to limitations in computational resources, we restrict our evaluation to the GPT2-XL model and exclude the GPT-J model from our assessment.From Table 6, we observe a correlation between the acceleration ratios and the ratio of the total demonstration length (L demo ) to the length of the text predicted (L x ).It suggests that a greater ratio of total length to predicted text length may yield a higher acceleration ratio.
In addition, the table illustrates that datasets with longer demonstration lengths tend to exhibit higher acceleration ratios.For instance, the AGNews dataset, which has the longest L demo , presents the highest acceleration ratio among the datasets analyzed.These findings could indicate an increased efficiency of the Hidden anchor method in contexts involving longer demonstration lengths.

J Calculation of k
For the sampled sequence x 1 , ..., x T to be predicted, we denote the query vectors of the target positions as q 1 , ..., q T .We then compute the matrix Q = (q 1 − q, ..., q T − q) by subtracting the mean vector, q, from each query vector.Subsequently, we determine the M directions, v 1 , ..., v M , that correspond to the M largest variation directions for the centralized query vectors q1 , ..., qT .The i th direction, v i , is chosen to maximize the variance of the projection of the centralized query vectors onto it, while also being orthogonal to the previously chosen directions, v 1 , ..., v i−1 .This process can be formalized as follows: We define σ i as the square root of the variance of the projection of Q onto the i th direction, i.e., Var v ⊤ i Q .
To derive features ks, we project the key vector k onto the directions v 1 , ..., v M and scale the projections by the corresponding standard deviations σ 1 , ..., σ M .Each feature, ki , is thus calculated as σ i v T i k.
We further examine the influence of M on the prediction confusion matrix, Confusionij pred , as depicted in Figure 14.Given the similarity in outcomes for various M , we settle on a value of M = 10 for computation of Confusionij pred .

K Calculation of Confusion ij
To gauge the true degree of confusion between categories i and k for a given model, we suggest utilizing the Confusion ij metric: First, we procure all test samples x t bearing true labels i or k.We then obtain the probabilities p t i and p t j yielded by the model for categories i and k, respectively, on these samples.These probabilities are normalized to a total of 1. Essentially, we derive a classifier f that delivers the probabilities p t i and p t j for the categories i and k respectively, on the test samples x t .By calculating the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) value of this classifier f , we get the degree of confusion between category i and k, termed as Confusion ij .
The computed Confusionij is a value that never exceeds 1.The closer Confusionij approximates 1, the less pronounced the confusion, and vice versa.
We use the above metric instead of directly analyzing the output labels of the model because previous work has indicated the issue of insufficient output probability calibration in ICL (Zhao et al., 2021), which is greatly affected by factors such as sample ordering and model preferences for specific label words.By leveraging our defined degree of confusion, Confusion ij , we can implicitly alleviate the disturbances arising from insufficient probability calibration on the output labels.This allows for a more accurate representation of the model's degree of confusion for different categories, mitigating the impact of randomness.

L Reproducibility
In the supplementary material, we have provided codes that allow for the faithful replication of our experiments and subsequent result analysis.To ensure consistency and reproducibility across different devices, we have fixed the five random seeds to the values of 42, 43, 44, 45, and 46.We invite readers to delve into the code for additional implementation details that may arouse their interest.

Figure 1 :
Figure 1: Visualization of the information flow in a GPT model performing ICL.The line depth reflects the significance of the information flow from the right word to the left.The flows involving label words are highlighted.Label words gather information from demonstrations in shallow layers, which is then extracted in deep layers for final prediction.

Figure 2 :
Figure 2: Illustration of our hypothesis.In shallow layers, label words gather information from demonstrations to form semantic representations for deeper processing, while deep layers extract and utilize this information from label words to formulate the final prediction.

Figure 3 :
Figure 3: Relative sizes of S wp , S pq , and S ww in different layers on SST-2 and AGNews.Results of other datasets can be found in Apendix B. Initially, S wp occupies a significant proportion, but it gradually decays over layers, while S pq becomes the dominant one.

Figure 4 :
Figure4: The impact of isolating label words versus randomly isolating non-label words within the first or last 5 layers.Isolating label words within the first 5 layers exerts the most substantial impact, highlighting the importance of shallow-layer information aggregation via label words.
Kobayashi et al. (2020) points out, attention should be multiplied by the norm of the key vector to yield 'more interpretable attention'.The AUC-ROC metric can implicitly account for these factors, thus allowing us to uncover the correlation more effectively.(2) The proportion of different labels output by the model may be unbalanced.Using the AUC-ROC metric can help mitigate this issue, reducing disturbances caused by class imbalance.Considering the residual mechanism of transformers, we can view each layer's hidden state as the cumulative effect of all prior layer calculations.To quantify the accumulated contribution of the first l layers to model prediction, we introduce R l : R l = l i=1 (AUCROCi − 0.5) N i=1 (AUCROCi − 0.5) .

Figure 5 :
Figure 5: AUCROC l and R l of each layer in GPT models.The result is averaged over SST-2, TREC, AGNews, and Emoc.AUCROC l reaches 0.8 in deep layers, and R l increases mainly in the middle and later layers.
(a) Confusion matrix of Confusion pred ij .(b) Confusion matrix of Confusionij.

Figure 6 :
Figure 6: Predicted and real confusion matrix on TREC.We set undefined diagonals to 1 for better visualization.The heatmaps display similarity in confusing category pairs, particularly in lighter-colored blocks.

Figure 7
Figure7illustrates the relative sizes of S wp , S pq , and S ww on TREC and EmoC, mirroring results on SST-2 and AGNews.In shallow layers, S wp (the information flow from the text part to label words)

Figure 7 :
Figure7: Relative size of S wp , S pq , and S ww on TREC and EmoC, which is similar to that on SST-2 and AG-News.

Figure 11 :Figure 12 :
Figure 11: AUCROC l and R l of each layer in GPT models when more demonstrations are employed.

Figure 13 :
Figure 13: AUCROC l and R l of each layer of LLaMA-33B on SST-2.Still, deep layers display higher relevance to model prediction, reinforcing the idea that the model extracts information from deep-layer anchors for classification.

Table 2 :
Results of different compression methods on GPT2-XL and GPT-J (averaged over SST-2, TREC, AG-News, and EmoC).Acc.denotes accuracy.The best results are shown in bold.Our method achieves the best compression performance.

Table 3 :
Acceleration ratios of the Hidden anchor method.

Table 4 :
Demonstration templates and label words.Here <S1> represents the demonstration, <S> represents the input to be predicted, and <L> represents the label word corresponding to the demonstration.To save space, we only show one demonstration for each task.

Table 6 :
Acceleration ratios, L demo and L x .