Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning

While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance. We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already"well-specialized"in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn't need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model.


Introduction
Transfer learning from a pretrained language model (PLM) is now the de-facto paradigm in natural language processing (NLP).The conventional approaches of leveraging PLMs include fine-tuning all the parameters in the language model (LM) and some lightweight alternatives that can decrease the number of tuning parameters such as adaptertuning (Houlsby et al., 2019;Hu et al., 2022;He et al., 2022) and prefix-tuning (Li and Liang, 2021).These methods have one thing in common: they all involve the entire PLM and attach a classifier to its top layer.However, PLMs were optimized via the language modeling objective and thus their top * Work done during internship at TTI-Chicago.
layers have been specialized in producing representations which facilitate optimizing that objective.Such mismatch between the pretraining and finetuning objectives poses the following questions: ① Given a pretrained language model and a downstream task, can we measure how "well-specialized" each layer has already been in that task, without any task-specific tuning?
② If the answer to ① is yes, can we use the layer-wise "task-specialty" as a guide in improving the computation efficiency of the transfer learning methods such as fine-tuning and adapter-tuning?
In this paper, we take a technically principled approach to investigate the research questions ① and ②.First, we define a metric in section 3.1 to measure the "task-specialty" of each layer in a given PLM.Our task-speciality score is inspired by the neural collapse (NC) phenomenon which has been widely observed in the computer vision community (Papyan et al., 2020): as training converges, the top-layer representations of the images with the same label form an extremely tight cluster.In our setting, we examine the variability of the representations of the linguistic sequences given by each layer of the PLM, and define our layerwise task-specialty to be the within-class variability normalized by the between-class variability.Computing our metric does not require any training or hyperparameter tuning.Experiments on the GLUE benchmark demonstrate that it is highly correlated with layer-wise probing performance, thus giving a clear "yes" to the question ① above.
We propose several layer-selecting strategies in section 3.2 based on our proposed task-specialty metric.Our strategies are complementary to all the major paradigms of transfer learning (such as fine-tuning and adapter-tuning) and thus can take advantages of the state-of-the-art at the time: only the selected layers will be tuned (e.g., via finetuning or using adapters) such that the computation cost of the tuning methods can be further reduced.Experiments on the GLUE benchmark demonstrate that our proposed strategies are highly effective: under comparable computation budget, fine-tuning or adapter-tuning the layers selected by our strategies can achieve significantly higher performance than using the layers selected by the widely adopted baseline strategies; it can even often match the performance of fine-tuning or adapter-tuning the entire PLM which takes 500% more computation cost.
Through extensive ablation studies, we demonstrate the comparable advantages of our proposed task-specialty metric over potential alternatives (such as CCA and mutual information) as well as its robustness to data scarcity and data imbalance.

Technical Background
In this paper, we focus on classification tasks.Technically, each classification task has a corpus of training data {(x n , y n )} N n=1 where each x n = (x n,1 , . . ., x n,T ) is a sequence of linguistic tokens and each y n ∈ Y is a discrete class label.Such tasks include • Sentiment analysis.Each x is a single sequence of words such as "This movie is fantastic" and y is a sentiment label from {positive, negative}.Thus, the sentiment analysis can be cast as a binary-class classification problem.• Natural language inference.Each x is of the form "premise [SEP] hypothesis" such as "Fun for adults and children.
[SEP] Fun for only children."where " [SEP]" is a special separator token.The label y ∈ {yes, neutral, no} indicates whether the premise entails the hypothesis.It is a three-class classification problem.A PLM performs a classification task as follows: 1.It reads each given sequence x n and embed it into a series of hidden state vectors layer L h where h (ℓ) n,t denotes the hidden state of token x n,t given by layer ℓ and x n,0 = CLS is a special classification (CLS) token.2. The top-layer hidden state h (L) n,0 of the CLS token is read by a neural network f followed by a softmax layer, which gives the probability dis- Figure 1: An illustration of our variability-based taskspecialty metric with hypothetical data.Each dot denotes a two-dimensional hidden state vector and its color denotes its target label.Each colored star denotes the mean vector of its class.
tribution over the target label y ∈ Y: The net f is also called "classification head".
Transfer learning is to maximize the log probability of the ground-truth label y n -i.e., log p(y n | x n )-by learning the parameters of the classification head f as well as certain methodspecific parameters: • Fine-tuning updates all the PLM parameters (Peters et al., 2018;Devlin et al., 2018).• Adapter-tuning inserts adapters (i.e., small neural networks) into the LM layers and updates the new adapter parameters (Houlsby et al., 2019;Hu et al., 2022;He et al., 2022).• Prefix-tuning augments trainable tokens to the input x and tunes the new token embeddings (Li and Liang, 2021;Qin and Eisner, 2021;Hambardzumyan et al., 2021).

The Method
Our goal is to answer the research questions ① and ② introduced in section 1.That involves finding a layer-specific metric ν (1) , . . ., ν (ℓ) , . . ., ν (L)  where each ν (ℓ) measures the task-specialty of layer ℓ.Suppose that we use s (ℓ) to denote the task score that we can achieve by letting the classification head read the layer ℓ hidden state h (ℓ) n,0 of the CLS token.If ν (ℓ) is highly (positively or negatively) correlated with s (ℓ) , then the answer to question ① is yes.To answer question ② involves designing ν-based strategies that select a subset of layers to use in transfer learning approaches.
In this section, we introduce our task-specialty metric ν (ℓ) (section 3.1) along with a few strategies for selecting layers (section 3.2).In section 4, we will empirically demonstrate the effectiveness of our proposed metric and strategies.

Hidden State Variability Ratio
For a given task, we define our task-specialty metric ν (1) , . . ., ν (L) based on the variability of the hidden state vectors that the PLM produces by embedding the training input sequences {x n } N n=1 .We use hypothetical data to illustrate our intuition in Figure 1: after grouped based on the target labels y n , the variability of the hidden states within the same group (dots of same color) measures the difficulty of separating them, while the variability of the mean states (stars of different colors) quantifies how easy it is to tell the different groups apart.
Technically, for each layer ℓ, we first define the sequence-level hidden state h (ℓ) n for each input x n to be the average of the hidden states of all the (non-CLS) tokens h n : y n = y}.The mean vector of each group is defined as h(ℓ) h and they correspond to the stars in Figure 1.The mean vector between classes is defined as Then the withingroup variability Σ (ℓ) w and between-group variability Σ (ℓ) b are defined using the sequence-level states and mean states: b are a lot like covariance matrices since they measure the deviation from the mean vectors.Finally, we define our task-specialty metric to be the within-group variability Σ (ℓ) w scaled and rotated by the pseudo-inverse of between-class variability Σ The pseudo-inverse in equation ( 3) is why we use the average state as our sequence-level representation: averaging reduces the noise in the state vectors and thus leads to stable computation of ν (ℓ) .We believe that the layers with small ν (ℓ) are likely to do better than those with large ν (ℓ) when transferred to the downstream task.Our belief stems from two key insights.
Remark-I: neural collapse.Our proposed metric is mainly inspired by the neural collapse (NC) phenomenon: when training a deep neural model in classifying images, one can see that the top-layer representations of the images with the same label form an extremely tight cluster as training converges.Extensive theoretical and empirical studies show that a lower within-class variability can indicate a better generalization (Papyan et al., 2020;Hui et al., 2022;Galanti et al., 2022).Thus we examine the variability of the layer-wise representations of the linguistic sequences and hope that it can measure the task-specialty of each layer of the given PLM.Our metric is slightly different from the widely accepted neural collapse metric; please see Appendix A.1 for a detailed discussion.Remark-II: signal-to-noise ratio.In multivariate statistics (Anderson, 1973), trace Σ w Σ † b is able to measure the inverse signal-to-noise ratio for classification problems and thus a lower value indicates a lower chance of misclassification.Intuitively, the between-class variability Σ b is the signal which one can use to tell different clusters apart while the within-class variability Σ w is the noise that makes the clusters overlapped and thus the separation difficult; see Figure 1 for examples.Remark-III: linear discriminant analysis.A low ν implies that it is easy to correctly classify the data with linear discriminant analysis (LDA) (Hastie et al., 2009).Technically, LDA assumes that the data of each class is Gaussian-distributed and it classifies a new data point h by checking how close it is to each mean vector hy scaled by the covariance matrix Σ which is typically shared across classes.Though our metric does not make the Gaussian assumption, a low ν suggests that the class means hy are far from each other relative to the within-class variations Σ w , meaning that the decision boundary of LDA would tend to be sharp.Actually, our Σ w is an estimate to the Gaussian covariance matrix Σ of LDA.

Layer-Selecting Strategies
Suppose that our metric ν can indeed measure the task-specialty of each layer.Then it is natural to investigate how the knowledge of layer-wise taskspecialty can be leveraged to improve the transfer learning methods; that is what the question ② in section 1 is concerned with.Recall from section 2 that the major paradigms of transfer learning use all the layers of the given PLM by default.We propose to select a subset of the layers based on 5752 Classification head … [CLS]  (d) (4, 5, 5) Figure 2: We present our strategies with a toy model of L = 5 layers and ℓ * = 3.The green layers will be tuned (e.g.fine-tuned or adapter-tuned) during the task-specific training while the grey layers are not.The white layers are dropped from the tuning and inference procedures, thus further reducing the computation and memory cost.
the their task-specialty which will benefit all the paradigms of transfer learning: they may be able to only use the selected layers yet still achieve strong performance.Only using the selected layers will result in significant reduction of computation cost: • In fine-tuning, only the parameters of the selected layers will be updated.• In adapter-tuning, adapters are only added to the selected layers but not all the layers.• In prefix-tuning, we "deep-perturb" fewer layers.A smaller number of task-specific parameters means not only less training cost but also less storage cost and less inference cost.Strategy-I: ℓ * -down.We use ℓ * to denote the layer which achieves the best task-specialty: i.e., ℓ * def = argmin ℓ ν (ℓ) .Our first strategy is motivated by the following intuition: if layer ℓ * has already been well-specialized in the given task, then it may suffice to just mildly tune it along with a few layers below it.Meanwhile, we may keep the classification head on the top layer L or move it to the best-specialized layer ℓ * : the former still utilizes the higher layers in training and inference; the latter does not and thus will result in even less computation and memory cost.
Technically, we use (ℓ bottom , ℓ top , ℓ head ) to denote the strategy of selecting the layers ℓ bottom , ℓ bottom + 1, . . ., ℓ top and connect the classification head to the layer ℓ head .Then all the instances of our first strategy can be denoted as (ℓ bottom , ℓ * , L) or (ℓ bottom , ℓ * , ℓ * ) with appropriately chosen ℓ bottom .Figures 2a-2c illustrate a few specific instances of our ℓ * -down strategy.Strategy-II: ℓ * -up.Alternative to the ℓ * -down strategy, our second strategy is to select the layers above the best-specialized layer ℓ * and we call it ℓ *up strategy.Intuitively, if layer ℓ * is already wellspecialized in the given task, then what we need is perhaps just a powerful classification head.That is, we can regard the higher layers ℓ * + 1, . . ., L along with the original classification head f as a new "deep" classification head and then tune it to better utilize the layer ℓ * representations.
In principle, all the instances of our second strategy can be denoted as (ℓ * + 1, ℓ top , ℓ top ) or (ℓ * + 1, ℓ top , L) since we may select the layers up through ℓ top ≤ L and move the classification head f to layer ℓ top .Figure 2d shows an instance of our ℓ * -up strategy.
Note that our (ℓ bottom , ℓ top , ℓ head ) notation can apply to the conventional layer-selecting strategies as well.For example, (1, L, L) denotes the naive option of tuning all the layers of the given PLM; (L − 2, L, L) denotes a baseline method of only selecting the top three layers.

Experiments
We evaluated the effectiveness of our taskspecialty metric along with our layer-selecting strategies through extensive experiments on the six classification tasks of the GLUE benchmark (Wang et al., 2019).The tasks are: CoLA, MNLI, MRPC, QNLI, QQP, and SST-2.All of them are sequencelevel classification tasks related to natural language understanding, thus being very different from how language models are pretrained.
We chose the widely accepted RoBERTa model (Liu et al., 2019b) to be our PLM and used the pretrained roberta-large instance (355M parameters) downloaded from HuggingFace (Wolf et al., 2020).Our experiments are mainly conducted with this model.We also experimented with DeBERTa (He et al., 2020) to investigate whether our methods generalize across models:1 those results are in Appendix C.2 and are similar to the RoBERTa results.Prior work (Mosbach et al., 2020a) found that fine-tuning RoBERTa on GLUE could be unstable, so we ran each of our experiments with five random seeds and reported the means and standard errors.Experiment details (e.g., hyperparameters) can be found in Appendix B.
Our code is implemented in PyTorch (Paszke et al., 2017) and heavily relies on HuggingFace.It will be released after the paper is published.Implementation details can be found in Appendix B.2.

Answer to Question ①: Hidden State
Variability Ratio Measures Task-Specialty For each task, we computed the task-specialty ν (1) , . . ., ν (24) (defined in section 3.1) for all the 24 layers.They are plotted as blue curves in Figure 3: as we can see, the middle layers (10 ≤ ℓ ≤ 15) tend to have the lowest ν on most tasks.
Then we probed the pretrained RoBERTa: for each layer ℓ, we trained a classification head (see section 2) that reads the hidden state h (ℓ) n and evaluated it on the held-out development set.The probing performance is plotted as red curves in Figure 3: as we can see, the middle layers (10 ≤ ℓ ≤ 15) tend to achieve the highest scores.
How much is the probing performance correlated with the task-specialty metric ν?To answer that question, we regressed the probing performance on ν and found that they are highly correlated.All the slope coefficients are negative, meaning that a low ν predicts a high score.All the R 2 are high, meaning that a large fraction of the probing performance variation can be explained by the variation in the task-specialty score.Remarkably, R 2 = 0.97 on QQP.Detailed results (e.g., fitted lines) are in This set of experiments answers our question ① (section 1): yes, we can measure the task-specialty of each layer of a given PLM on a given task and our metric ν doesn't require task-specific tuning.
Furthermore, we fully fine-tuned a RoBERTa on each task and obtained the ν and probing performance of the fine-tuned models.The results are presented in Figures 12 and 13 of Appendix B.4.After full fine-tuning, strong correlation between the probing performance and the task-specialty ν is still observed, but now higher layers are observed to have lower ν and stronger probing performance.That is because the parameters of higher layers have received stronger training signals backpropagated from the classification heads, which aim to specialize the full model on the tasks.

Answer to Question ②: Task-Specialty
Helps Layer Selection As discussed in section 3.2, our task-specialtybased layer-selecting strategies are supposed to help improve the computation efficiency of transfer learning and they are compatible with all the major paradigms of transfer learning.We evaluated our strategies by pairing them with the widely adopted fine-tuning and adapter-tuning methods.In this section, we show and discuss our empirical results.Layer selection for fine-tuning.For each task, we experimented with our ℓ * -down and ℓ * -up strategies.The best-specialized layers ℓ * are those with the lowest task-specialty ν (ℓ) .They are CoLA MNLI MRPC QNLI QQP SST-2 ℓ * 18 14 14 13 14 16 Our experiment results are shown in Figure 4.For the ℓ * -down strategy, we experimented with (ℓ bottom , ℓ * , ℓ head ) where ℓ bottom ∈ {1, ℓ * − 2, ℓ * − 1, ℓ * } and ℓ head ∈ {ℓ * , L}.They are plotted as blue dots.For the ℓ * -up strategy, we experimented with (ℓ * + 1, L, L) and they are shown as green dots.For a comparison, we also experimented with the conventionally adopted baseline strategies (1, L, L) and (ℓ bottom , L, L) with ℓ bottom ∈ {L−2, L−1, L}.They are red dots.The actual performance scores of the dots along with standard errors are in Table 6 of Appendix C.1.
As shown in Figure 4, when we are constrained by a budget of tuning only ≤ 3 layers, our strategies can almost always lead to significantly higher performances than the baseline strategies.Moreover, the (ℓ bottom , ℓ * , L) strategies consistently out- perform the (ℓ bottom , ℓ * , ℓ * ) strategies across all the tasks.It means that the higher layers ℓ * + 1, . . ., L are still very useful for performing well on the given task even if their parameters are not updated: they are perhaps able to help shape the training signals back-propagated through the tuned layers.
Furthermore, the performance of only tuning the selected layers is often close to that of full fine-tuning.Remarkably, on QQP, our (1, ℓ * , ℓ * ) strategy even matches the performance of full finetuning yet only uses the bottom 14 layers; that is, it reduces the computation and storage cost by more than 40%.Detailed discussion about wall-clock time saving can be found in Appendix C.1.
Because most ℓ * are in the middle of the PLM, we experimented another baseline of directly using the middle layer ℓ mid = 13 for the 24-layer RoBERTa-large.This baseline is only implemented on CoLA and SST-2 because ℓ * of other tasks is already the same as or very close to ℓ mid .We found that tuning around layer ℓ * always outperforms tuning around ℓ mid although the improvement is not large.Result details are in Table 8 of Appendix C.1.
Layer selection for adapter-tuning.For adaptertuning, we implemented the earliest adapter architecture designed by Houlsby et al. (2019).We experimented with the same set of our strategies and baseline strategies as in the fine-tuning experiments.The results are plotted in Figure 5. Detailed numbers including standard errors are listed in Table 7 in Appendix C.1.Like for fine-tuning, in most cases, our strategies are significantly better than the baseline strategies under the budget of tuning only ≤ 3 layers.The (ℓ bottom , ℓ * , L) strategies also tend to outperform the (ℓ bottom , ℓ * , ℓ * ) strategies.
What's impressive is that the performance of only adapter-tuning the selected layers is often comparable to that of full adapter-tuning.Surprisingly, on MNLI and QNLI, our (ℓ * − 2, ℓ * , L) and (ℓ * − 1, ℓ * , L) strategies match the full adaptertuning but only need 12% and 8% of the trainable parameters as full adapter-tuning respectively.
We also implemented the middle layer baseline for adapter-tuning on CoLA and SST-2.The results are listed in Table 9 of Appendix C.1.In most cases, ℓ * outperforms ℓ mid with the same number of adapted layers.
Interestingly, Houlsby et al. (2019) also found that not every layer in the PLM is equally important for adapter-tuning.For example, they experimented with a 12-layer BERT model and found that ablating the layer-7 adapters will lead to a much larger performance drop than ablating those of the top three layers on the CoLA task.

Is Our Metric the Only Option?
In this section, we will discuss a few potential alternatives to our proposed task-specialty metric ν.The introduction of the alternatives will be brief and only focused on their definitions and intuitions; more details can be found in Appendix C.7. Canonical correlation analysis.A potential alternative to our proposed metric is the canonical correlation between the hidden states h n and the class labels y n .Canonical correlation analysis (CCA) involves learning a series of projection parameters (v 1 , w 1 ), . . ., (v J , w J ) such that each (v j , w j ) maximizes the correlation between n and w ⊤ j y n under certain constraints: y n is a one-hot vector with its y n th entry being one.
Numerical Rank.Another potential alternative is the rank-based metric proposed by Zhou et al. (2022).It is also inspired by the neural collapse phenomenon.The intuition is: if the sequence-level representations h y .The correlation between ϱ (ℓ) and the probing performance is presented in Table 1.Analysis.As we can see in Table 1, our variabilitybased metric ν and CCA score both exhibit a high correlation (above 0.85) with the probing performance.The correlation of CCA is moderately better than ours and it is higher than 0.9 on all the tasks.However, CCA optimization is dependent on the regularization terms (see Appendix C.7) and thus requires cross-validation to choose the right set of hyperparameters.This makes it more involved to compute than the proposed variability metric.Technically, CCA requires a "training procedure" to fit a set of parameters (v j and w j ) though the computation of that part is as light as calculating our metric.Both CCA and our metric are scale invariant.Overall, the rank-based metric ϱ has a lower correlation.That is because it only considers the within-group properties but ignores the between-group properties.

Ablation Studies and Analysis
Averaged state vs. CLS state.The ν that we have reported so far are all computed using the average hidden state of each x n as defined in section 3.1.We also experimented with the CLS token hidden states-i.e., h n,0 -and found that the computation of ν became much less stable: e.g., on SST-2, the metrics ν (ℓ) of the pretrained RoBERTa are below 50 for most of the layers but can jump above 500 for a couple of layers.Intuitively, one would like to trust a ν (ℓ) curve that is at least locally smooth.Technically, local smoothness means a small local second-order derivative, which can be estimated by finite differences.Therefore, we measure the local smoothness of a ν (ℓ) curve by the following quantity ζ The results2 are presented in see, the ζ computed using the CLS token states is dramatically larger than that of using the averaged states on all the tasks except CoLA.It means that using the averaged states will give us a more trustable ν (ℓ) curve.Thus, we used the average hidden states throughout our experiments; we made this choice before seeing any task performance.
The actual ν (ℓ) curves computed using the CLS states are presented in Figure 14 of Appendix C.5.Robustness to data imbalance.The datasets of GLUE tasks are almost perfectly balanced.So it remains a question whether our proposed taskspecialty ν is still effective when the data is not imbalanced.To answer this question, we synthesized a series of data-imbalanced experiments on SST-2.In each experiment, we randomly sampled N = 20000 training examples with a portion p from the negative class where p ∈ {0.5, 0.25, 0.1, 0.05}.
We did the same analysis on MNLI; see results in Appendix C.6.
For each experiment, we plot the ν (ℓ) curve in Figure 6a.We also computed the local smoothness score ζ and the absolute value of the correlation ρ(ν, s) (defined in section 4.3) and plot them in Figure 6b.Interestingly, the ζ in the case of p = 0.25 is as low as that of p = 0.5, though it becomes much larger when the data is extremely imbalanced (e.g., p < 0.1).Moreover, the ρ(ν, s) is still above 0.85 even when p = 0.1.Those findings mean that our task-specialty metric is reasonably robust to data imbalance.More details are in Appendix C.6.Robustness to data scarcity.We also examined the robustness of our metric to data scarcity.On SST-2, we conducted a series of experiments with varying number of training samples: N ∈ 40000, 20000, 5000, 200.For each experiment, we made the sampled dataset label-balanced.See the results on MNLI in Appendix C.6.
Like in the data-imbalanced experiments, we plot ν and ζ and |ρ| in Figure 7.As we can see, for N ≥ 5000, the ν curves almost perfectly overlap and the ζ and |ρ| only slightly change with N .It means that the ν values are trustable as long as they are computed using thousands of examples.In the extremely data-scarce case of N = 200, the ν curve becomes not trustable: after all, it will be extremely difficult to estimate Σ w and Σ b with so few samples.However, even in the case of N = 200, the middle layers tend to have the lowest ν (ℓ)   which agree with the data-adequate cases.More details are in Appendix C.6.

Related Work
Analysis of PLMs.Representations from PLMs have been widely studied using probing methods. 3here is evidence showing that pre-trained features from intermediate layers are more transferable (Tenney et al., 2019;Rogers et al., 2020).Additionally, Voita et al. (2019a) show that masked language modelling objective introduces an autoencoder like structure in the PLM, meaning the behaviour of the top most layers is similar to the bottom layers.Studies have also shown that the effects of fine-tuning are non-uniformly spread across different layers (Peters et al., 2019;Merchant et al., 2020;Liu et al., 2019a;Phang et al., 2021).Those findings challenge the default choice of tuning the entire PLM for adapting it to downstream tasks.While probing tools have been used to study task-specific layer importance (Tamkin et al., 2020;Mosbach et al., 2020b), the probing paradigm is parametric, hard to interpret, and is known to be unreliable (Ravichander et al., 2020;Belinkov, 2022;Hewitt and Liang, 2019;Voita and Titov, 2020;Pimentel et al., 2020).Instead, we propose to measure the layer-wise task-specialty of PLMs using a non-parametric tool that quantifies the task-specific variability of the hidden features.PLM-based transfer learning.PLMs are widely used in the transfer learning setup to improve performance on a variety of downstream tasks.Typically, a task-specific classification layer is added on the top of the network and the entire network is trained to minimize the supervised task loss.Recently there has been growing interest in parameterefficient alternatives to this approach.A subset of these methods add a few new trainable parameters to the PLM while the pre-trained weights are frozen and are thus kept consistent across tasks (Houlsby et al., 2019;Guo et al., 2020;Li and Liang, 2021).Another set of methods either reparameterizes PLMs (Hu et al., 2022) or chooses only a subset of the PLM parameters (Voita et al., 2019b;Sajjad et al., 2020;Gordon et al., 2020;Zaken et al., 2021), thus reducing the number of trainable parameters for transfer learning.In particular, the "early exit" methods (Xin et al., 2020(Xin et al., , 2021;;Zhou et al., 2020) allow samples to pass through part of PLM if the prediction from a middle layer is trusted by the off-ramp following that layer.This method can reduce inference cost but increase training cost because it adds a classification head to each hidden layer.Our technique can reduce both training and inference cost by tuning fewer layers and moving classification head to an intermediate layer.
In this work we leverage our proposed metric of layer-wise task-specificity to make an informed decision to retain/drop and tune/freeze layers on the downstream task.There has been work studying the effect of dropping PLM layers (Sajjad et al., 2020;Phang et al., 2021;Tamkin et al., 2020) but their decision is driven by the performance on the downstream task itself and thus every new task will require a slew of ablation studies to find the applicability of each layer.Whereas, our taskspecificity measure is completely parameter-free and is also agnostic to the specific transfer learning approach (fine-tuning, adapter-tuning, prefixtuning) and thus complements the existing methods on parameter-efficient approaches (He et al., 2022).
Neural collapse.Our task-specialty metric ν is inspired by the neural collapse phenomenon.A surging line of work has been demystifying the training, generalization, and transferability of deep networks through NC (Kothapalli et al., 2022).For training, recent works showed that NC happens for a variety of loss functions such as cross-entropy (Papyan et al., 2020;Zhu et al., 2021;Fang et al., 2021;Ji et al., 2022), mean-squared error (Mixon et al., 2020;Han et al., 2022;Zhou et al., 2022;Tirer and Bruna, 2022), and supervised contrastive loss (Graf et al., 2021).For generalization, Galanti et al. (2022); Galanti (2022) show that NC also happens on test data drawn from the same distribution asymptotically, but not for finite samples (Hui et al., 2022); Hui et al. (2022); Papyan (2020) showed that the variability collapse is happening progressively from shallow to deep layers; Ben-Shaul and Dekel (2022) showed that test performance can be improved when enforcing variability collapse on features of intermediate layers; Xie et al. (2022); Yang et al. (2022) showed that fixing the classifier as simplex ETFs improves test performance on imbalanced training data and long-tailed classification problems.For transferability, Kornblith et al. (2021) showed that there is an inherent tradeoff between variability collapse and transfer accuracy.

Conclusion
In this paper, we present a comprehensive study on how to measure the task-specialty of each layer of a pretrained language model as well as how to leverage that knowledge to improve transfer learning.Our proposed layer-wise task-specialty metric is based on the variability of the hidden states of each layer given a task-specific corpus.Our metric is highly correlated with the layer-wise probing performance, though it is cheap to compute and does not require any training or hyperparameter tuning.We propose a couple of strategies based on the metric for selecting a subset of the layers to use in the PLM-based transfer learning methods.Extensive experiments demonstrate that our strategies can help fine-tuning and adapter-tuning achieve strong performance under a greatly reduced computation budget.Our strategies are complementary to all the major paradigms of PLM-based transfer learning and thus they will also benefit other methods.

Limitations
Our main technical limitation is that the proposed metric only measures the task specificity from the perspective of variability.Thus, it might underestimate the task specificity if the features has other kinds of good spacial structures with large within-class variability.For example, concentric rings are separable but not linearly separable; see Fig-3 in Hofmann (2006).Although we have seen that our proposed ν is a good predictor for the final performance in all our experiments (section 4), it is still possible that, for some tasks and some models, the layers with high ν can actually achieve good performance.Fortunately, such clustering structure is rare in the hidden space of deep neural networks.
Another technical limitation is that our proposed hidden state variability ratio only works for classification tasks.An open research question is how to generalize it to regression or generation tasks.

A Metric Details
A.1 Compared to the Neural Collapse Metric Our task-specialty metric ν defined in section 3.1 is a lot like the neural collapse metric proposed by Papyan et al. (2020) except that they assume a balanced dataset.First, they define the h(ℓ) to be global mean vector: i.e., h(ℓ n , but we define it to be the mean of the within-group mean vectors.In data-balanced cases, those two definitions are equivalent.But when data is imbalanced, our version is better since it prevents the h(ℓ) from being dominated by the group that has the largest number of samples.
If we strictly follow Papyan et al. ( 2020), then our within-class variability Σ (ℓ) w and between-class variability Σ (ℓ) b will be defined to be Then the within-class variability Σ w will also be dominated by the group that has the largest number of samples.However, our current definition in section 3.1 will scale the h (h − hy )(h − hy )) ⊤ term by |G (ℓ) y | before taking the outer sum y , thus being more robust to data imbalance.
To verify our intuition, we conducted a series of data-imbalanced experiments on SST-2 like we did in section 4.4.In each experiment, we randomly sampled N = 20000 training examples with a portion p from the negative class where p ∈ {0.5, 0.1, 0.05}.Then we constructed the ν curves and plot them in Figure 8: solid curves use our math formulas in section 3.1 while dashed lines use the formulas that strictly follow Papyan et al. ( 2020).As we can see, when data is balanced, the solid and dashed lines are exactly the same.When data is imbalanced, the solid lines still stay close while the dashed lines move apart.This figure illustrates that our formulas are more robust to data imbalance.

B Experiment Details
For the pretrained language model, we use the implementation and pretrained weights (robertalarge with 24 layers and 355M parameters) of Huggingface (Wolf et al., 2020).We follow previous work (Wang et al., 2019) and use different evaluation methods for different tasks:  • On CoLA, we use Matthews correlation coefficient.
• On MRPC and QQP, we use F1 score.
• On the others, we use classification accuracy.

B.1 Training Details
We only tune learning rate for each task and strategy and specific tuning methods.The number of epochs is 10 for full-fine-tuning on MNLI and QQP and 20 for all the other experiments.All the other hyperparameters are the same for all the experiments.We use the AdamW optimizer with a linear learning rate scheduler with 6% warm-up steps.We set the dropout rate to be 0.1.The weight decay is 0. The batch size is 8.We evaluate on the validation set per epoch and report the best result.We run the experiments on Nvidia RTX A4000 and GeForce RTX 2080 Ti.We use the standard splits and the datasets can be downloaded at https://huggingface.co/datasets/glue.

B.2 Implementation Details
Our code is implemented in PyTorch (Paszke et al., 2017) and heavily relies on HuggingFace.It will be released after the paper is published.
For all the experiments that requires a new task-specific classification head, we relied on the original implementation in RoBERTa of Huggingface (Wolf et al., 2020) 4 .
For adapter-tuning, we implemented the earliest adapter architecture designed by Houlsby et al. (2019) and relied on the public implementation 5 provided by He et al. (2022).The bottleneck dimension is 256 and the adapter uses the same initialization as BERT (Devlin et al., 2018).The trainable parameters to update are the parameters of the inserted adapters, the layer normalization parameters in the selected layers, and the parameters of the classification head.

B.3 Probing Experiments
In section 4, we probed both the pretrained and fine-tuned models.On each GLUE task, we used a fixed learning rate in Table 3

B.4 Fine-Tuning Experiments
For the fine-tuning experiments in section 4.2, we only tuned the learning rate.Ideally, we should have swept a large range of learning rates for all the strategies and found the best learning rate for each strategy; but that would require too much computation cost that we couldn't afford.Our preliminary experiments showed that small learning rates tend to work better when the number of trainable parameters is large and that large learning rates tend to work better when the number of trainable parameters is small.Therefore, we set a different range of learning rates for each different strategy based on their numbers of trainable parameters.The ranges that we used are in Table 4.For each strategy, on each task, we chose the best learning rate based on the performance on the held-out validation set.The (ℓ bottom , ℓ * , ℓ * ) and (ℓ bottom , ℓ * , L) strategies use the same learning rate as the conventional (ℓ bottom , L, L) strategy.

B.5 Adapter-Tuning Experiments
For the adapter-tuning experiments in section 4.2, we only tuned the learning rate.For the same reason as we discussed in Appendix B.4, we set a different range of learning rates for each different strategy based on their numbers of trainable parameters.The ranges that we used are in Table 5.Again, the (ℓ bottom , ℓ * , ℓ * ) and (ℓ bottom , ℓ * , L) strategies use the same learning rate as the conventional (ℓ bottom , L, L) strategy.

C.1 Detailed numbers of RoBERTa experiments
As mentioned in section 4.1 and section 4.2, the mean values and standard errors of finetuning and adapter-tuning RoBERTa with different strategies are listed in Table 6 and Table 7.The standard error of our strategies' performance is not significantly higher or lower than the baselines.So our strategies can't help solve the stability issue in fine-tuning PLMs.
As discussed in section 4.1 and section 4.2, we also fine-tuned and adapter-tuned PLM with the middle layer baseline on CoLA and SST-2 and listed the results in Table 8 and Table 9.As discussed in section 4.1, we compare the computation cost and storage cost of some strategies on MNLI in Table 10.When only keeping ℓ * = 14 layers in the PLM, it reduces inference cost and number of parameters by 40%.In general, using our method will reduce the computation though the actual saving depends on the implementation and the devices; see Table 10 for details of our experiments.For example, when using the ℓ *up strategies, the most optimized implementation would cache the output of the bottom layers and reuse them, which will further reduce the training and inference cost.But we haven't implemented it yet.So there is still plenty of room to improve the efficiency over

C.2 Experiments on DeBERTa
As discussed in section 4, we also conducted experiements on DeBERTa-base.
We computed the task-specialty metric ν for all the 12 layers and plotted with each layer's probing performance in Figure 9 as we did in Figure 3.For all the tasks except SST-2, we can observe the same pattern as in RoBERTa: the layers with low ν tend to have high probing performance.On SST-2, the task-specialty metric isn't negatively correlated with the probing performance.This might be an example mentioned in section 6 that the layers with high ν can also achieve good performance.Because the best layer ℓ * selected by the metric is already the last layer for SST-2, we implemented our strategies (ℓ * , ℓ * , ℓ * ), (ℓ * , ℓ * , L) on all the tasks except SST-2.We compared them with baseline (L, L, L) and full fine-tuning to see whether the metric can help make fine-tuning more efficiently.The results are plotted in Figure 10 and listed in Table 11.When only fine-tuning 1 layer, our strategy (ℓ * , ℓ * , L) always achieves the best performance and the baseline (L, L, L) is always the worst.On MRPC, QNLI and QQP, the performance of (ℓ * , ℓ * , L) is even close to the performance of full fine-tuning with fewer than 10% tuning parameters.We also compared with middle layer baseline on MNLI.The results are listed in   values of N .Some of the previous work, although using CCA for representation learning, follows the same scoring procedure to choose the regularization parameters and the corresponding projection matrices (Wang et al., 2015).
CCA has been previously used to measure similarity of layer-wise representations within and across neural network models (Raghu et al., 2017), and to measure similarity of the layer-wise wordlevel representations with off-the-shelf embedding maps (Pasad et al., 2021).Williams et al. (2019) use CCA to correlate grammatical gender and lexical semantics by representing the discrete gender class as a one-hot vector.Numerical Rank.As discussed in section 4.3, the rank-based metric is ϱ (ℓ) def Mutual information.Discrete mutual information (MI) gives a measure of mutual dependence between two discrete random variables.We use MI dependence between the sentence representations and the corresponding GLUE task labels as a measure of layer-wise task specificity.In order to discretize continuous-valued sentence representations, we run k-means clustering to obtain discrete clusters, as in Voita et al. (2019a).This measure has been previously used to measure the phone and word content in layer-wise representations of a pretrained speech model (Pasad et al., 2021).
In our experiments, we sampled N classbalanced data-points.We held a tenth of these samples out and ran the k-means clustering algorithm with C clusters on the sampled data.Then the categorical ID of each held-out sample is de-fined to be the ID of the learned cluster that it was assigned to.However, after extensive tuning, we still could not obtain any mutual information numbers that look reasonably high: actually, all the numbers were close to zero and they didn't differ much.We believe that the difficulty stems from the fact that learning clusters is unsupervised and unsupervised learning is known to be difficult.Indeed, if we just use the class labels y n as the cluster IDs, we can observe a neat clustering-that is why our proposed metric ν is effective.However, it seems extremely difficult to learn that clustering in an unsupervised fashion.5768 (a) High within-class variability and low between-class variability.(b) Low within-class variability and high between-class variability.
. These sequence-level states correspond to the dots in Figure 1.Then we group all the h (ℓ) n based on the target labels y n : G

Figure 3 :
Figure 3: The task-specialty metric (blue) and probing performance (red) of each layer of a pretrained RoBERTa model.Each figure is a GLUE task.

Figure
Figure 11 of Appendix C.3.This set of experiments answers our question ① (section 1): yes, we can measure the task-specialty of each layer of a given PLM on a given task and our metric ν doesn't require task-specific tuning.Furthermore, we fully fine-tuned a RoBERTa on each task and obtained the ν and probing performance of the fine-tuned models.The results are presented in Figures12 and 13of Appendix B.4.After full fine-tuning, strong correlation between the probing performance and the task-specialty ν is still observed, but now higher layers are observed to have lower ν and stronger probing performance.That is because the parameters of higher layers have received stronger training signals backpropagated from the classification heads, which aim to specialize the full model on the tasks.

Figure 4 :
Figure 4: Task performance vs. the number of selected layers for fine-tuning.The annotation of each dot is its strategy identifier (ℓ bottom , ℓ top , ℓ head ).

Figure 5 :
Figure 5: Task performance vs. the number of selected layers for adapter-tuning.The annotation of each dot is its strategy identifier (ℓ bottom , ℓ top , ℓ head ).

yF
will have a low rank since its columns will be similar.The rank of H where ∥∥ * is the nuclear norm (i.e., the sum of singular values) and ∥∥ F is the Frobenius norm.The rank-based metric is the average over y: ϱ Figure 6: Results of data-imbalanced experiments on SST-2.

Figure 9 :
Figure 9: The task-specialty metric (blue) and probing performance (red) of each layer of a pretrained De-BERTa model.Each figure is a GLUE task.

Figure 10 :Figure 14 :
Figure 10: Task performance vs. the number of selected layers for fine-tuning DeBERTa-base.The annotation of each dot is its strategy identifier (ℓ bottom , ℓ top , ℓ head ).C.3 Task-Specialty vs. Probing Performance for Pretrained Models As discussed in section 4.1, we regressed the probing performance on ν.The regression results are in Figure 11.C.4 Task-Specialty vs. Probing PerformanceAfter Full Fine-Tuning As discussed in section 4.1, we fully fine-tuned a RoBERTa on each task and obtained the ν and probing performance of the fine-tuned models.The results are presented in Figures12 and 13.C.5 About Computing Task-Specialty Using the CLS Token States As discussed in section 4.4, we computed ν (ℓ) using the CLS token hidden states and found that

Table 2 .
As we can

Table 3 :
to train a classification head for each layer.Learning rate for probing experiments.

Table 4 :
Learning rate for fine-tuning experiments.

Table 5 :
Learning rate for adapter-tuning experiments.

Table 7 :
Table 10 guided by our experimental insights.Results of adapter-tuning RoBERTa with different strategies

Table 9 :
Comparison of adapter-tuning with middle layer baseline on CoLA and SST-2

Table 10 :
Computation cost per epoch for RoBERTa-large fine-tuning experiments on MNLI.

Table 11 :
Results of fine-tuning DeBERTa with different strategies

Table 12 .
ℓ mid works better than ℓ * this time.

Table 12 :
Comparison of fine-tuning DeBERTa-base with middle layer baseline on MNLI