An Empirical Study on the Transferability of Transformer Modules in Parameter-efficient Fine-tuning

Parameter-efficient fine-tuning has garnered lots of attention in recent studies.On this subject, we investigate the capability of different transformer modules in transferring knowledge from a pre-trained model to a downstream task. Our empirical results suggest that every transformer module is a winning ticket such that fine-tuning the specific module while the rest of the network is frozen achieves a comparable performance to the full fine-tuning case. Among different modules in LMs, LayerNorms exhibit a significant capacity for transfer learning to the extent that with only 0.003% updateable parameters in the layer-wise analysis, they can show acceptable performance on various target tasks.We argue that the performance of LayerNorms could be attributed to their high-magnitude weights compared to other components in a pre-trained model.


Introduction
Fine-tuning is widely used as a procedure to employ the knowledge learned during pre-training of language models for specific tasks (Howard and Ruder, 2018;Peters et al., 2019;Merchant et al., 2020;Zhou and Srikumar, 2022).However, finetuning can be a computationally expensive process, given that it usually involves updating all the parameters in transformer-based models which are often massive in size.Parameter-efficient finetuning methods try to ameliorate this by reducing the number of updatable parameters during finetuning.* The authors contributed equally to this work.
Adapters (Houlsby et al., 2019;Pfeiffer et al., 2020;Wang et al., 2021;Rücklé et al., 2021;Karimi Mahabadi et al., 2021;Hu et al., 2021) try to circumvent this issue by inserting lightweight modules in the transformer blocks, tuning of which usually results in comparable performance to the full fine-tuning (while the number of updatable parameters is significantly lower).Nevertheless, introducing new parameters to an already-large model can be considered a drawback.Another category of parameter-efficient fine-tuning methods is based on the Lottery Ticket Hypothesis (Prasanna et al., 2020), where the goal is to find a small subset of parameters that can compete with the full fine-tuning setting.Various subsets of network parameters have been suggested as the winning ticket, including the connections with high magnitudes (Han et al., 2015), identity mappings (Lin et al., 2020), and dominant dimensions (Guo et al., 2021).
In this paper, we study the ability of different modules of a transformer block in knowledge transfer.Our experiments provide a more comprehensive analysis than the existing work, which usually suggests specific modules as the winning ticket, such as the bias terms (Ben Zaken et al., 2022).Through module-wise fine-tuning, we check if the winning ticket is a property that can be associated only with some specific modules in the transformer block.Our results suggest that all individual modules possess this property to some extent.Among these, LayerNorms prove to be highly reliable for knowledge transfer: fine-tuning only 37k LayerNorm weights (out of 110M parameters in BERT-base) is often on par with full fine-tuning on various downstream tasks.Extending this analysis, we show that tuning even only one LayerNorm can yield comparable performance and that the middle layers are the best in terms of transferability.We also investigate the reasons behind the effectiveness of LayerNorm tuning.Our experiments suggest that this could be due to the relatively high-magnitude weights in these modules.In fact, we show that tuning just a tiny fraction of high-magnitude dimensions (usually referred to as outliers) can lead to competitive performance on various tasks.

Winning Modules
According to the Lottery Ticket Hypothesis, there are small sub-networks whose performance is comparable to the over-parameterized model on different tasks (Frankle and Carbin, 2019).Several studies have been carried out to identify subnetworks across the model that can provide the best transferability (Gale et al., 2019;Evci et al., 2020;Lee et al., 2021;Guo et al., 2021;Hu et al., 2021).Nonetheless, finding the winning sub-network usually requires extra computation, which is costly in terms of time and memory.In this section, we take another look at the transformer block of BERT and focus on the ability of its different modules to transfer knowledge to various downstream tasks.More specifically, we aim to find the winning module among the different modules in the transformer-based architecture of the pre-trained BERT.
Models.We opt for bert-base-uncased, implemented by the HuggingFace library in TensorFlow (Wolf et al., 2020;Abadi et al., 2015).The maximum sequence length is set to 128.Except for the fully fine-tuned model (Full-FT), where we train the models for five epochs, the number of epochs is chosen based on the size of the tasks: 10 epochs for SST-2, QQP, MNLI, and QNLI and 20 epochs otherwise.We use the Adam optimizer with an epsilon set to 1e-6, a warmup ratio of 10%, and a batch size of 16.The only hyperparameter tuning we do is on choosing the learning rate from {1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2} to draw a fair comparison with previous work.We report the average and standard deviation of the results of three models trained with different random seeds.All the models are trained on four NVIDIA Tesla V100S-32G GPUs.
Module Settings.To find out the potential of transformer modules in transfer learning, we pick similar modules across all layers and fine-tune them while keeping the rest of the network frozen.The aim of this setup is to broaden our insights on the distribution of knowledge across the model and the adaptability of different modules to target tasks.
In every transformer block, we check for the role played by the Multi-head attention, Feedforward layer, and LayerNorms in knowledge transfer.Since every transformer block has two Layer-Norms (attention and feedforward), we also consider fine-tuning them separately (LayerNorms A and LayerNorms F ).We also compare against the replicated results of BitFit (Ben Zaken et al., 2022), in which consistent bias terms across the transformer blocks are employed for fine-tuning.To verify if consistency in selecting parameters matters, we also show the results of fine-tuning only a small randomly selected subset of all the parameters with the same size as LayerNorms (Random).In the experiments, the full fine-tuning (Full-FT) and Frozen modes are considered as the upper and lower bounds, respectively.

Results
Table 1 shows our experimental results on eight tasks from the GLUE benchmark.1For each setting, we also report the corresponding ratio of updatable parameters (compared to full fine-tuning).As can be observed, individual modules of BERT can be considered as winning tickets because they can achieve comparable performance to the Full-FT setting, despite involving significantly smaller numbers of trainable parameters.In particular, LayerNorms prove to have a high potential in transferability and adaptability to various downstream tasks with a very limited set of trainable parameters (0.034%).The performance is mostly preserved even when only one of the two Layer-Norms is set to be trainable, reducing the number of effective parameters to 0.017% of that in the full fine-tuning.Moreover, our results also reveal that selecting consistent weights (similar modules across layers) has a key role in fine-tuning quality, given that the random subset of a comparable number of parameters does not lead to the same performance levels.

Token-level Classification
In addition to the sentence-level tasks of the GLUE benchmark, we also conduct experiments on two different token-level datasets to broaden our insights on the capacity of individual modules: Penn Treebank Part-of-speech tagging (Marcus et al., 1993) and CoNLL-2003 Named Entity Recognition (Tjong Kim Sang and De Meulder, 2003).For part-of-speech tagging, we use the subset of the Wall Street Journal (WSJ) portion of PTB which is freely available in the Natural Language Toolkit (Bird et al., 2009, NLTK).In this experiment, we adhere to the convention of using the cased version of BERT, given the case-sensitive nature of these token-level tasks.Table 2 summarizes the results.Similarly to what is observed on the sentence-level tasks, Ley-erNorms can attain competitive performance on the two token-level tasks, despite involving just a small fraction of all the model parameters.Moreover, in comparison with the equal number of randomly selected weights, they demonstrate remarkably better performance.

Single Norms Tuning
Previous studies have reported that different layers do not contribute equally to the ultimate performance in transfer learning (Zhou and Srikumar, 2021;Rogers et al., 2020;Kovaleva et al., 2019;Mehrafarin et al., 2022).We are interested in studying the extent to which individual Layer-Norms in different transformer blocks are adaptable to downstream tasks.To this end, we perform a layer-wise analysis in which the only trainable parameters are the two LayerNorms in each block and the final classifier.Therefore, the total number of fine-tuning parameters is less than 5K Table 3: The performance of layer-wise fine-tuning of LayerNorms on the selected downstream tasks for BERT.
The LayerNorms in the middle layers tend to have the highest transferability.
(3,072 and 1,538 for LayerNorms and the classifier, respectively) 2 , which is about 0.003% of all the parameters.Due to our limited computational resources, we restrict our experiments to CoLA, MRPC, STS-B, and RTE.Table 3 presents the results for the layer-wise analysis.According to fine-tuning results, tuning a single LayerNorm may be sufficient to achieve performance comparable to fine-tuning all Lay-erNorms.Furthermore, the middle-layer Layer-Norms exhibit the best results across all layers, which can be attributed to the high transferability of the middle layers in BERT, corroborating previous findings on the concentration of task-specific features in these subsets of the network (Liu et al., 2019).

Analysis
In the previous section, we have shown that different modules of a transformer block can play as the winning tickets, since they all have the potential for transferring knowledge to the selected downstream tasks.Among different modules, Layer-Norms have proven to be the most reliable in finetuning.In this section, we search for the reasons behind the effectiveness of these modules.To this end, we focus on the magnitude of every weight and how they change during full fine-tuning across all layers.
As a first step, in Figure 1, we visualize the distribution of weights for different BERT modules on RTE and STS-B (more tasks can be found in the Appendix).In general, the distribution of weights is similar across Feed-Forward and Multi-Head modules.Nevertheless, LayerNorms tend to have a bimodal distribution, with one of the modes having significantly higher magnitudes.The pattern is consistent across LayerNorms A and LayerNorms F .We hypothesize that these highmagnitude weights are the reason behind the effectiveness of LayerNorms and, in what follows, 2 For STS-B, the number of classifier parameters is 769.check our hypothesis by restricting our experiments to only high-magnitude dimensions of Lay-erNorms.

Outlier Tuning
Outliers are high-magnitude weights in Layer-Norms appearing early in the pre-training process (Kovaleva et al., 2021).Transformer-based models perform significantly worse on downstream tasks when their outliers are disabled after the finetuning process (Kovaleva et al., 2021).
In this experiment, we choose outliers as the set of n weights whose values are farthest from the mean.Except for the outliers, all the parameters are frozen during fine-tuning.It should be considered that the specific dimensions where the outliers appear may not necessarily be the same across different layers.
Table 4 presents the performance of fine-tuned BERT in two different settings and for four different values of n: 4, 16, 64, 256.We also report the results for the corresponding sets of n randomly selected weights.As can be observed, outliers tuning leads to competitive performance on most target tasks, despite using less than 0.0056% of all the model parameters.Interestingly, tuning in the extremely constrained setting of n = 4 still outperforms the frozen model, sometimes by significant margins (e.g., on STS-B).Setting n to higher values gives the model more capacity, bringing about higher performance.
Overall, we can conclude that the highmagnitude weights in LayerNorms play an important role in the effectiveness of these modules in parameter-efficient fine-tuning.

Conclusions
In this work, we study the efficiency of different modules in the transformer block of BERT to transfer knowledge from the pre-trained model to various downstream tasks.Our experimental results demonstrate that, contrary to what was sug-   gested by previous work, every module can be a winning ticket, achieving comparable performance to the full fine-tuning scenario.Among all modules, LayerNorms prove to be the most reliable for transferability with a limited number of trainable weights, such that tuning them in only one layer can be sufficient for attaining performance on a par with that of the full fine-tuning.We find that the weights in these modules have notably high magnitudes compared to other modules, which could be the reason for their effectiveness.We examine this hypothesis through outlier tuning (tuning only the n weights in a LayerNorm whose values are farthest from the mean), limiting the number of tunable parameters to a significantly small fraction.
Our results pave the way for better parameterefficient fine-tuning of large language models without the need for costly algorithms to determine the optimum sub-network or introduce additional parameters for knowledge transfer.

Acknowledgment
We thank the anonymous reviewers for the constructive comments and suggestions that helped improve the paper.Sara Rajaee is funded in part by the Netherlands Organization for Scientific Research (NWO) under project number VI.C.192.080.

Limitations
We were subject to the constraints of computational resources; as a consequence, we reported results only for bert-base and chose the four smallest tasks of the GLUE benchmark in the tuning single norms (Section 2.4) as well as outliers tuning (Section 3.1).Obviously, the more trainable parameters a model has, the more accurate its results will be.Since our Outlier Tuning technique finetunes just a tiny portion of parameters, less than 0.006% of the model weights, there is an upper bound on its learning capability.

Figure 1 :
Figure1: The empirical distribution of a random subset of weights in different modules of BERT.For better visualization, we have discarded outliers.The weights of LayerNorms appear to have a bimodal distribution with significantly higher overall averages and standard deviations. n

Figure 2
Figure 2 demonstrates the distribution of weights for different BERT modules on MRPC and CoLA.

Figure 2 :
Figure 2: The empirical distribution of a random subset of weights in different modules of BERT on MRPC and CoLA.

Table 1 :
The performance of BERT on the GLUE benchmark with different fine-tuning strategies.We report Matthew's correlation for CoLA, F1 score for MRPC and QQP, Spearman's correlation for STS-B, and accuracy for the rest.LayerNormsA (LayerNormsF ) stands for the scenario in which only the LayerNorms of Attention (Feedforward) modules are set to be trainable.The best and the second-best results are highlighted for each task.

Table 4 :
The performance of the fine-tuned BERT with n trainable parameters in every LayerNorm module on four different target tasks.Selecting the n parameters from the outliers leads to better performance in most cases, compared to the random selection.For n = 256, the results of outlier tuning are comparable with the Full-FT scenario.
Quentin Lhoest, and Alexander Rush.2020.Transformers: State-of-the-art natural language processing.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38-45, Online.Association for Computational Linguistics.Yichu Zhou and Vivek Srikumar.2021.DirectProbe: Studying representations without classifiers.In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5070-5083, Online.Association for Computational Linguistics.Yichu Zhou and Vivek Srikumar.2022.A closer look at how fine-tuning changes BERT.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1046-1061, Dublin, Ireland.Association for Computational Linguistics.