Exploring the Impact of Model Scaling on Parameter-Efficient Tuning

Parameter-efficient tuning (PET) methods can effectively drive extremely large pre-trained language models (PLMs) by training only minimal parameters. Different PET methods utilize different manually designed tunable modules. In small PLMs, there are usually noticeable performance differences among PET methods. Nevertheless, as the model scale increases, the performance differences become marginal. Hence, we hypothesize that model scaling mitigates the impact of design differences on PET methods. To investigate this hypothesis, we introduce a more flexible PET method called Arbitrary PET (APET) method. The APET method is compatible with a tunable module, which consists of any number of parameters distributed in arbitrary positions. Then, we utilize it and conduct experiments on 11 NLP tasks across 3 representative PLMs. Our investigations reveal that model scaling (1) mitigates the effects of the positions of tunable parameters on performance, and (2) enables tuning methods to achieve performance comparable to full-parameter fine-tuning by optimizing fewer tunable parameters. Intriguingly, we also observe that tuning methods optimize the similar number of tunable parameters to exceed random guess performance on different tasks. We collectively discuss this phenomenon and the two aforementioned findings from an optimization perspective to understand the underlying mechanisms. These conclusions enhance our understanding of the impact of model scaling on PET and assist in designing more effective and efficient PET methods for PLMs of different scales. The source code can be obtained from this GitHub repository: \url{https://github.com/yushengsu-thu/PET_Scaling}.

† Corresponding author: Z.Liu and M.Sun.

Feed Forward Network
Hidden States Q < l a t e x i t s h a 1 _ b a s e 6 4 = " V D Z M o R H G i q e B F h s K x 7 i U 8 9 I c e Y c = " > A A A B 8 n i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K j N S 0 W W h G 5 c t 2 A d M h 5 J J M 2 1 o J h m S O 0 I Z + h l u X C j i 1 q 9 x 5 9 + Y a W e h r Q c C h 3 P u J e e e M B H c g O t + O 6 W t 7 Z 3 d v f J + 5 e D w 6 P i k e n r W M y r V l H W p E k o P Q m K Y 4 J J 1 g Y N g g 0 Q z E o e C 9 c N Z K / f 7 T 0 w b r u Q j z B M W x G Q i e c Q p A S v 5 w 5 j A N I y y z q I y q t b c u r s S m J k g W 0 Z e 4 C u r j H G k t H 0 S 8 F L 9 v Z G R 2 J h 5 H N r J P K J Z 9 3 L x P 8 9 P I b o P M i 6 T F J i k q 4 + i V G B Q O L 8 f j 7 l m F M T c E k I 1 t 1 k x n R J N K N i W 8 h K 8 9 Z M 3 S e + m 7 j X q t 5 1 G r d k q 6 i i j C 3 S J r p G H 7 l A T P a A 2 6 i K K F H p G r + j N A e f F e X c + V q M l p 9 g 5 R 3 / g f P 4 A / R C R E g = = < / l a t e x i t >

Wq
< l a t e x i t s h a 1 _ b a s e 6 4 = " l Q V h t t k 7 t E e r y v E + U 8 V W Y q b I 0 b L p X z e v 7 q 0 a r X d Z R h V M 4 g w t w 4 Q Z a c A c d 6 A K B F J 7 h F d 6 s J + v F e r c + l q M V q 9 w 5 g T + w P n 8 A 0 M 2 T M w = = < / l a t e x i t >

LoRA
Trainable Modules (Optimizing) x Wv < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 N O 3 s 8 J h X g j 5 n f t i U G 0 u q + p V P / c = " > A A A B + H i c b V D L S s N A F L 2 p r 1 o f j b p 0 M 1 g E V y W R i i 4 L 3 b i s Y B / Q h j C Z T t q h k 0 m Y m R R q 6 J e 4 c a G I W z / F n X / j p M 1 C W w 8 M H M 6 5 l 3 v m B A l n S j v O t 1 X a 2 t 7 Z 3 S v v V w 4 O j 4 6 r 9 s l p and T5 (Raffel et al., 2020), have achieved great success on various natural language processing (NLP) tasks.Despite their effectiveness, finetuning (FT) these large-scale PLMs with full parameters incurs both unaffordable computational and storage costs.To solve this problem, researchers have proposed a series of parameter-efficient tuning (PET) methods (Houlsby et al., 2019a;Li and Liang, 2021;Mahabadi et al., 2021a;Lester et al., 2021;Mahabadi et al., 2021b;Hu et al., 2022a;Ben Zaken et al., 2022;He et al., 2022b) which only update an assigned tunable module consisting of minimal parameters while freezing the rest parameters in a PLM during model adaptation.
Although these existing representative PET methods can reduce computational and storage costs, there are usually noticeable performance differences among these representative PET methods on downstream tasks.Intriguingly, as the scale of a PLM increases, the performance differences among PET methods become narrower, as illus-trated in Figure 1.These findings are interesting and worth exploring because the existing representative PET methods are designed with disparate philosophies, e.g., tunable modules that are composed of different numbers of tunable parameters distributed in arbitrary positions.Hence, we hypothesize that model scaling mitigates the effects of the above design differences among the PET methods on performance.To validate this hypothesis, we further conduct two lines of ablation analyses: (A1) Whether the model scale mitigates the performance differences resulting from the position of tunable parameters.(A2) Whether the model scale mitigates the performance differences resulting from the number of tunable parameters.
However, solely investigating the four representative PET methods (see Figure 1) might be insufficient to encompass an adequate range of parameter positions for the ablation analyses (A1).Additionally, the tunable modules of these four PET methods are constrained to be composed of layer-level tensors or matrices, making it challenging to precisely control the number of tunable parameters at the fine-grained (parameter) level in the ablation analyses (A2).To facilitate the ablation analyses, we develop a more flexible Arbitrary Parameter-Efficient Tuning (APET) method ( § 5.1), which can be compatible with any number of tunable parameters distributed in arbitrary positions.
In analysis (A1), we compare the performance of APET methods with an equal number of tunable parameters distributed in different positions.Based on the experimental results, we observe smaller differences in the performance of these APET methods on larger models.This finding suggests that scaling the model mitigates the effects caused by the position of tunable parameters on performance.
In analysis (A2), we compare the performance of the same APET methods with varying numbers of tunable parameters.Based on the experimental results, we observe that model scaling does not mitigate the effects caused by the number of tunable parameters on performance.Furthermore, we have observed two interesting phenomena when the number of tunable parameters reaches two thresholds: the high threshold and the low threshold.When the number of tunable parameters equals the high threshold, APET methods can achieve the full-parameter fine-tuning performance of the corresponding backbone model, and the high threshold tends to be lower on the larger models.Namely, PET methods can optimize fewer tunable parameters to achieve full-parameter fine-tuning performance on the larger models.On the other hand, when the number of tunable parameters exceeds the low parameter threshold, all APET methods outperform random guess performance.We find that the low thresholds are nearly identical across the same models, even for different tasks.This suggests that across different tasks, PET methods can optimize a similar number of tunable parameters on the same PLM to surpass random guess performance.
In summary, we introduce a more flexible PET methods -APET methods -to conduct the extensive ablation analyses and reveal the impact of model scaling on PET design, e.g., (1) the position of tunable parameters ( § 5.2) and ( 2) the number of tunable parameters ( § 5.3).( 3) Furthermore, we discuss the findings of ablation analyses from the perspective of optimization ( § 6).We hope these conclusions not only encourage more researchers to explore the impact of model scaling on tuning methods from a theoretical perspective, but also provide guidance for designing tuning methods for models of different scales.

Related Work
Parameter-Efficient Tuning (PET) Methods With larger PLMs continuously being developed, fine-tuning all of the parameters and storing the adapted weights become increasingly cumbersome.To address the issue, researchers propose PET methods which keep most of the parameters of PLMs frozen and optimize only a tunable module consisting of a few parameters during downstream adaptation.Over the recent years, many different designs of PET methods have emerged.For instance, some PET methods insert the external tunable modules after the feed-forward and attention layers in a PLM (Houlsby et al., 2019a;Pfeiffer et al., 2021;Mahabadi et al., 2021c); others prepend the tunable modules into attention layers (Li and Liang, 2021;Hu et al., 2022a) or the embedding layer (Lester et al., 2021).Another line of PET method selects the existing parameters in a PLM (Ben Zaken et al., 2022;Guo et al., 2021) as the tunable module to optimize.To further enhance the performance of PET methods, some works propose automatic selection strategies (Hu et al., 2022c;Chen et al., 2023;Lawton et al., 2023;Zhou et al., 2023) for tunable parameters.

Unified View of PET Methods Positions of Tunable Modules
Prompt (Lester et al., 2021) W will be concatenated to input hidden states Adapter (Houlsby et al., 2019a) W will be plugged between SelfAttn./FFN.layers LoRA (Hu et al., 2022a) W will be plugged into SelfAttn layers BitFit (Ben Zaken et al., 2022) W will be add into Bias terms Table 1: We uniformly re-frame the transformations of PET methods as modifications ∆h of specific hidden states in the corresponding PLM layer (f ) where W is introduced in computing ∆h, as suggested by He et al. (2022a); Hu et al. (2022c).Each PET method has p tunable weights W in designed positions.Hence, we represent each PET tunable module as θ = {W 1 , W 2 , ..., W p }.
Although these PET methods have distinct tunable modules, they can be unified into a similar form.He et al. (2022a) formalize PET methods as a unified framework to study the connections among PET methods.Yi et al. (2022) also conduct the same study and further indicate that the optimization of different PET methods can be unified in a similar subspace.In this paper, we leverage these unified perspectives to explain the impact of model scaling on PET in the final discussion ( § 6).

The Power of Model Scaling
With the scaling of model size, PLMs emerge numerous capabilities, including reasoning ability (Wei et al., 2022b,a), and can achieve state-of-the-art results in various understanding and generation tasks (Du et al., 2022;Chowdhery et al., 2022).
In the adaption perspective, some researchers find that performing some PET methods (Lester et al., 2021;Ding et al., 2023;Su et al., 2022) on large-scale models can almost achieve the fullparameter fine-tuning performance.In this paper, we further find that as the model scale increases, the performance differences among distinct PET methods become smaller ( § 4).Hence, we study the impact of model scaling on PET methods ( § 5) to fathom this phenomenon and explain it from the optimization perspective ( § 6).

Preliminary
In this section, we first introduce the Transformer framework ( § 3.1) and the most representative PET ( § 3.2).

Transformer Framework
The Transformer model (Vaswani et al., 2017) is the mainstream architecture for most powerful PLMs.The model is stacked of L blocks, each of which consists of a sequence of layers, including selfattention and feed-forward network.During the forward pass through each block, the input hidden state is applied with the sequence of layers.For simplicity, we formalize the transformation of each layer as (1) Under the layer as the operator f , the input hidden state h in ∈ R s×d in is transformed into the output hidden state h out ∈ R s×dout , where s is the input length and d in , d out are dimensions.

Parameter Efficient Tuning (PET)
Different PET methods1 are equipped with diverse modules θ as shown in Figure 1.These modules are composed of tunable parameters W that modify the original layers and the corresponding transformations in PLMs.To make comparisons, we follow the unified view (He et al., 2022a;Hu et al., 2022c) to re-frame the transformations of all PET methods as the modifications ∆h of specific hidden states in the corresponding PLM's layers as follows: In the training process, given a downstream task D = {X, Y }, we only optimize all tunable parameters of the module θ for each PET method to generate desired outputs Y of a downstream task while freezing the rest of the parameters Φ in a PLM M, as shown in Figure 1 2 .Formally, the training objective is to minimize L as follows: (3)

Main Experiments
To explore the impact of model scaling on these PET methods, we first introduce the investigated tasks, PLMs, and settings of the existing representative PET methods in the experiments ( § 4.1), and then report the main experimental results ( § 4.2).
To ensure the consistency of the PET methods' performance, we maintain the original design of each method, including the positions of tunable parameters and the number of trainable parameters, as reported in the respective original papers.Additionally, we train each PET method on 11 tasks using 3 different random seeds and report their average performance.Further details regarding the training configurations can be found in appendix B.

Model Scaling Impact on PET Methods
To investigate the impact of model scaling on PET methods, we arrange the Pre-trained Language Models (PLMs) in ascending order based on their model scale, and we report the performance of PET methods on each type of PLM.
Results are reported in Figure 2. First, we can observe that the PET methods exhibit noticeable performance differences (standard deviation  This phenomenon is intuitive and demonstrates the critical impact of design differences (the position and quantity of parameters in the tunable module) on the performance of PET methods.This finding has been consistently found in numerous prior works (Ding et al., 2023;Hu et al., 2022c).However, we find that as the model scaling increases (from BERT SMALL to BERT LARGE in the subfigure [a]; from BLOOM 560M to BLOOM 7.1B in the sub-figure [b]; from T5 SMALL to T5 XXL in the sub-figure [c]), the performance discrepancies among PET methods diminish across all types of models, as evidenced by the decreasing standard deviation (S.D.) (from 5.08 to 2.65 on [a] BERT; from 3.46 to 2.50 on [b] BLOOM; from 2.75 to 1.72 on [c] T5).This finding implies that the larger model scaling can mitigate the impact of the design differences among the PET methods on performance.

Ablation Analyses
The design differences among the PET methods mainly lie in the tunable module's parameter position and parameter quantity.To further verify whether the model scaling will respectively remove the effects of the above differences on PET methods, we conducted two ablations to investigate whether model scaling can mitigate (1) the impact of tunable parameter position and (2) the impact of tunable parameter quantity.
However, only investigating the above four respective PET methods is insufficient to cover enough variations of parameter position for ablation study (1).This limitation makes us hard to preciously control the number of tunable parameters at the fine-grained (parameter level) in ablation study (2).Hence, we develop a more flexible PET method, Arbitrary Parameter-Efficient Tuning (APET) method.Its tunable module can be arbitrary structure ( § 5.1) that facilitates us to explore various parameter positions in the ablation study ( § 5.2) and easier control the number of tunable parameters in the ablation study ( § 5.3).

Arbitrarily Parameter-Efficient Tuning (APET)
Similar to PET methods, the APET method is equipped with arbitrary module θ which is composed of L tunable weights W distributed in any position of a model.Here, APET have three operations to insert the tunable weight W into any position of the PLM, thereby modify the specific layers and their corresponding transformations as follows: ADD The tunable weight W will be into the PLM layer.The corresponding transformation of a PLM layer can be denoted as: CONCAT The tunable weight W will be concatenated with the hidden state or the layer in the PLM.The corresponding transformation of a PLM layer can be denoted as: (5) PLUG The tunable weight W will be plugged between PLM layers.The corresponding transformation of a PLM layer can be denoted as: Arbitrary Parameter

E!cient Tuning (APET)
Add & Layer Norm Feed Forward Network

Multi-Head Attention
Add & Layer Norm Hidden States r c + l q M V q 9 w 5 g T + w P n 8 A 0 M 2 T M w = = < / l a t e x i t > Wv < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 N O 3 s 8 J h X g j 5 n f t i U G 0 u q + p V P / c = " Note that the inserted tunable weights W are not limited to the aforementioned structure as shown in Figure 3; they can be arbitrary structures.According to the inserted tunable weights and the corresponding modifications, the transformations of a PLM layer for APET method can be expressed as: By comparing Equation 7with the equations of previously introduced Equation 2, it is obvious that the PET methods are special cases of APET method.The module θ of APET are composed of arbitrarily inserted weights W, which can be expressed as θ = { W 1 , W 2 , ..., W L }.In the training process, we follow Equation (3) only to optimize θ while freezing the rest of the parameters (Φ) in a PLM.

The Impact of Differences in Parameter Position on Performance
To investigate whether model scaling can mitigate the impact of parameter position in PET, we initially freeze other significant factors, i.e., the number of tunable parameters, that could potentially affect the performance.Given that the tunable parameters of the four aforementioned PET methods We denote the APET methods with the coresponding numbers of parameters as APET Prompt , APET BitFit , APET LoRA , and APET Adapter , respectively.Each APET method will arbitrarily select tunable parameters with different random seeds, each random seed representing a different parameter distribution.Here, S.D. means the standard deviation.
As the model scaling increases, the impact caused by the parameter position on the performance becomes minor.
are fixed in the same positions, it is challenging for us to precisely conduct an experiment to assess the impact of position.Under this limitation, we then employ the APET method to arbitrarily select tunable parameters with different random seeds, each random seed representing a different parameter distribution, and train them on the tasks.
In the experiments, we set the number of tunable parameters for the APET methods in four groups.The parameter quantity in each group (bar: h ) corresponds to that of the aforementioned four PET methods' (Prompt, BitFit, LoRA, Adapter).We denote these APET methods with varying numbers of parameters 3 as APET Prompt , APET BitFit , APET LoRA , and APET Adapter , respectively.Besides, we conduct the ablation study on three series of models (BERT, BLOOM, and T5) and report task (SST, RTE, and MRPC) average performance.
Performance Comparison As shown in Figure 4, there are four groups of comparisons in each sub-graph.We can observe that as a PLM size scales (BERT: from ), the performance differences (standard deviation (S.D)) of APET methods within each group decrease.Based on this findings, we argue that larger models demonstrate greater effectiveness in mitigating the impact of differences in parameter position on performance.
In addition, we have observed that despite the different number of tunable parameters in four different groups (bar: h ) of APET methods, they have fewer performance differences on the larger model.We will delve into this finding further and provide an explanation for this phenomenon in § 5.3.

The Impact of Differences in The Number of Tunable Parameters on Performance
In this section, given the APET method under different numbers of tunable parameters, we observe their performance to conduct an ablation study.
From the reported results in Figure 5, we can find that (1) on the smaller models, e.g., BERT SMALL (---), BLOOM 560M (---), T5 SMALL (---) when the tunable parameters of tuning methods are fewer than a certain number, the performance will drop to randomly guess performance; (2) similarly, Figure 5: Given the different numbers of tunable parameters, we observe APET performance on three series of models and tasks.We find that (1) the model scaling can make tuning methods optimize fewer necessarily tuned parameters to reach full-parameter fine-tuning performance (--and --); (2) APET methods require the similiar number of tunable parameters (low parameter thresholds lie the similiar range) to exceed random guess performance on the same models.
this phenomenon still holds on the larger models, BERT LARGE (--), BLOOM 7.1B (--), T5 XXL (--).Based on these findings, we can argue that that model scaling cannot adequately eliminate the impact of the number of tunable parameters on the performance of PET methods.
Interestingly, we find two parameter thresholds for tunable parameters in all models and name them as low parameter threshold (Low, Low, Low, Low, Low, Low) for necessary tuned parameters and the high parameter threshold (High, High, High, High, High, High) for necessary tuned parameters, respectively in Figure 5.When tunable parameters are more than low parameter threshold, the APET method can exceed random performance (e.g.,

1×100
Number of label types % on BERT, 0% on BLOOM, and 0% on T5); when the tunable parameters are more than high parameter threshold, the APET method can almost achieve the fullparameter fine-tuning (FT) performance.Furthermore, we find that the model scaling affects the two parameter thresholds.Hence, we explore this phenomenon in the following paragraphs.

High Threshold of Necessary Tuned Parameters
Based on the experimental results in the sub-graph [c.1] (SST2) of Figure 5, we find that the high threshold of the larger model is consistently lower than the high threshold of the smaller model.This phenomenon holds true across all tasks (SST2, RTE, MRPC), and for all series of models, as depicted in all sub-graphs.Therefore, we can conclude that model scaling enables tuning methods to train fewer necessary parameters while achieving the similar performance of full-parameter fine-tuning.
This conclusion can intuitively explain why APET methods can achieve relatively similar performance on larger models, especially on T5 XXL , as illustrated in the aforementioned [c.2] in Figure 2.This is due to the fact that the number of tunable parameters in each group of APET methods surpasses the high parameter thresholds on T5 XXL ; hence, they all achieve the similar performance of full-parameter fine-tuning.

Low Threshold of Necessary Tuned Parameters
From the above results, we find that APET methods will exceed the random guess performance (0% on T5; 0% on BLOOM; 50% on BERT) and immediately reach the 80~90% full-parameter fine-tuning performance when the tunable parameters are more than low thresholds.However, the low thresholds are relatively higher on larger models (BERT LARGE , BLOOM 7.1B , T5 XXL ).Namely, APET methods require more tunable parameters to exceed the random guess performance.This phenomenon is con-sistent over all tasks on all series of models.Hence, we can infer that the model scaling cannot reduce the number of necessary tuned parameters to drive PLMs to perform downstream tasks.
Furthermore, it is worth noting that the low parameter thresholds of the APET methods almost lie in the same range on the same models.Specifically, the range of low thresholds are in [8.0e+2, 3.2e+3] [7.9e+3, 8.5e+3] on T5 XXL , [8.0e+2, 7.9e+3] on T5 SMALL .We will explain this phenomenon from the optimization perspective in § 6.

Discussing the Ablation Results from the Optimization Perspectives
The objectives of all parameter-efficient tuning methods (PET, APET) can be expressed as min θ L(M (Φ,θ) (X), Y ) as introduced in Equation ( 3), where θ is a tunable module.The module θ of different PET methods consists of different structures and varying numbers of tunable parameters.In this paper, we investigate the impact of model scaling on different modules, which possess varying numbers of tunable parameters distributed across multiple positions.We find that the larger model scaling can (1) mitigate the effects caused by the difference positions of tunable parameters ( § 5.2) and (2) make PET methods optimize fewer tunable parameters to achieve full-parameter finetuning performance ( § 5.3).To further fathom these phenomena, we will investigate the underlying reasons from an optimization perspective.
(3) Besides, we also observe that PET methods can optimize almost the similar number of necessarily tuned parameters to exceed random guess performance on the same backbone models ( § 5.3).Although phenomenon (3) is not caused by model scaling, we can also explain it from the optimization perspective.Next, we together discuss it and the above two findings (1) and (2) in the following paragraphs.
Why model scaling mitigates the effects caused by the differences in positions of tunable parameters on the PET performance?From the optimal control perspective, a tunable module (θ) of a tuning method can be seen as a controller (Yang and Liu, 2022;Ding et al., 2023) to drive PLMs towards downstream tasks.As the model scale increases, the larger model has higher parameter redundancy (Aghajanyan et al., 2021), allowing arbitrary selection of tunable parameters for tuning without greatly degrading performance (Desai et al., 2019;Chen et al., 2020;Prasanna et al., 2020;Evci et al., 2020); thus, controllers (modules) might have higher degrees of freedom.This might explain why the aribitray positions of the tunable parameters have less impact such that all PET methods can achieve the similar performance on the larger models.It is worth noting that even though the distribution of tunable parameters have less impact on the performance, it still affects converge speeds.Thus, finding a better parameter distribution to improve the converge speeds for PET methods is a direction worthy of exploring.
Why model scaling leverages the fewer tunable parameters to achieve full-parameter finetuning performance?Tuning θ to steer a PLM towards downstream NLP tasks can be seen as adaptations.From the perspective of representation space, the adaptations of PET methods can be re-parameterized into a unified low dimensional subspace (Qin et al., 2021;Aghajanyan et al., 2021;Yi et al., 2022).Aghajanyan et al. (2021) further demonstrate that adaptation on a larger PLM can be re-parameterized into the lower dimensional space; this implicitly explains why PET methods can optimize fewer parameters on larger-scale models, e.g., T5 XXL , to meet the full-parameter fine-tuning performance on tasks.
Why can PET methods optimize the similar numbers of tunable parameters to exceed random guessing?As stated above, the adaptations of the PET methods can be re-parameterized into a unified subspace.Qin et al. (2021) show that this low dimensional subspace is shared among all NLP tasks for the same PET methods.Yi et al. (2022) further suggest that this subspace is also shared among various PET methods.This might implicitly explain why all PET methods can tune the similar numbers of necessary tuned parameters to exceed the random guessing performance on the same models, even for the different tasks ( § 5.3).

Conclusion
The realm of model scaling for LLMs presents important and intriguing directions for the LLM community.The increasing of model scale unveils numerous emerging capabilities and advantages.In this work, our primary emphasis is on the impact of model scaling as it pertains to PET methods.
Through our comprehensive observation studies and in-depth discussions from optimization perspectives, we gain deeper insights into the effects of model scaling on PET and the reasons behind the observed phenomena.We believe that our findings will serve as a catalyst, inspiring further meticulous research and exploration in this area.

Limitations
This paper might have some possible limitations as follows: (1) we only explore the effects of the scaling law on performance.There might be other research points worth exploring, such as the power of model scale to convergence speed; (2) we study the power of model scale with comprehensive empirical experiments and explain the findings from the optimization perspective.There might be more theoretical proofs to explain these exciting findings.

A Task and Dataset
We use various NLP tasks to evaluate the APET methods, which can be divided into the following 5 categories: Sentiment Analysis (SA) SA tasks evaluate if a model can correctly predict the sentiment labels of an input sentence.In this paper, we choose SST-2 (Socher et al., 2013), IMDB (Maas et al., 2011), and Rotten Tomatoes (Pang and Lee, 2005).
Natural Language Inference (NLI) NLI tasks evaluate a model's ability to correctly classify if a hypothesis can be entailed or not given a premise.In this paper, we choose MNLI (Williams et al., 2018), QNLI (Wang et al., 2019), and RTE (Bos and Markert, 2005).
Paraphrase Identification (PI) PI tasks evaluate if a model can correctly identify paraphrases, which means two sentences are identical in semantic meaning.In this paper, we choose MRPC (Dolan and Brockett, 2005), and QQP (Sharma et al., 2019).
Question Answering (QA) QA tasks evaluate a model's ability to answer questions.Context may be present.In this paper, we choose NQ-Open (Lee et al., 2019), an open-world QA dataset without context.Summarization (SUM) SUM tasks evaluate a model's ability to summarize a long paragraph into a shorter abstract without loosing the semantics of the original text.In this paper, we choose SamSUM (Gliwa et al., 2019), and Multi-News (Fabbri et al., 2019) in our experiments.

B Parameter-efficient Tuning (PET) Methods
Here, we first recap the PLM (transformer) layer.Then, we describe the detail and training configurations of the PET methods shown in Figure 1.

B.1 Transformer Architecture
A PLM is generally a stack of multiple Transformer layers, each composed of a multi-headed attention and a feed-forward network.The multi-headed attention contains h attention heads working in parallel.Specifically, given an input X ∈ R n×d , the i-th attention head works as follows: where n is sequence length, d is the hidden dimension, W i q ∈ R n×d is query, W i k ∈ R n×d is key, and W i v ∈ R n×d is value.The output from each attention head will be concatenated and further transformed by W o ∈ R d×d and be denoted as: where h MHA ∈ R n×d is the output hidden state of multi-headed attention layer.After that, h will be fed into a two-layer feed-forward network where During the forward pass through each (transformer) block, the input hidden state is applied with the sequence of layers.For simplicity, we formalize the transformation of each layer as Under the layer as the operator f , the input hidden state h in ∈ R n×d is transformed into the output hidden state h out ∈ R n×d , where s is the input length, and d is the dimension.

B.2 Implementation Details of PET Methods
Prompt Prompt-tuning (  locations: (1) after the first feed-forward layer, and (2) after the two consecutive feed-forward layers.
During training, only the adapter modules are optimized and the rest of the PLM is frozen.

B.3 Training Configurations of PET Methods
The tunable module of a PET method θ is composed of L tunable weights W (all tunable weights) of the specific PET method, which can be expressed as θ = {W 1 , W 2 , ..., W L }.We also follow Equation (3) to train the PET method.During training, we only optimize θ while freezing the rest of the parameters in the PLM.We adopt a batch size of 32 and have no warm-up for most of the PET models and tasks.The maximum input length is 128 for single sentence tasks (SA) and 256 for multisentence tasks (NLI, PI, QA, SUM).The maximum generation length is 1 for classification tasks (SA, NLI, PI), 64 for Multi-News, and 128 for SAM-Sum.On the BERT, BLOOM, T5 models, we set their learning rates as {3e-4}, {3e-4, 5e-5}, {1e-4, 1e-3, 1e-2} respectively.Then, we choose the best performance to report.

C Arbitrary Parameter-Efficient Tuning (APET) Methods
We introduce a more flexible PET method, Arbitrary Parameter-Efficient Tuning (APET) method.Its tunable module can be arbitrary structure that facilitates us to explore various module structures (parameter position) and easier control the number of tunable parameters.

C.1 Implementation Details of APET Methods
As we previously introduced in § 5.1, the tunable module of the APET method is composed of tunable weights.Each tunable weight can be expressed as W. Here, we have three operations to insert the tunable weight W into the PLM to modify the specific layers and their corresponding transformations as follows: ADD We will add the tunable weight W into the PLM layer.The corresponding transformation can be denoted as h out : CONCAT We will concatenate the tunable weight W and the hidden state or the layer in the PLM.The corresponding transformation can be denoted as h out : PLUG We will plug the tunable weight W between PLM layers.The corresponding transformation can be denoted as h out : According to these operations and the corresponding transformations, we can express the APET methods as h out : By comparing the Equation ( 15) with the equations of the previously introduced PET methods, we can clearly find that the PET methods are special cases of APET methods.
set (e.g.negative/not entailment/false -> 0, positive/entailment/true -> 1).Utilizing a unified label set makes it feasible to evaluate the transferability of the AFP method among different types of tasks regardless of the divergence of original labels.
The results are shown in Figure 7, from which we can find that the APET (APET DISCRETE and APET ADJACENT ) methods can transfer to the same type of tasks demonstrated by the darker color alongside the diagonal of the matrix and generally perform well both on small-scale PLMs (Figure 7 (a): BERT SMALL and T5 SMALL ) and large-scale PLMs (Figure 7 (b): BERT LARGE and T5 XXL ).However, the lighter color indicates that APET methods have difficulty performing different types of tasks overall, and both small-scale and large-scale PLMs share this phenomenon.This finding indicates that the power of scale does not necessarily facilitate the generalization ability of AFP methods which is in line with the prevalent assumption that fewer parameters often cause underfitting, whereas more parameters tend to cause overfitting.Nevertheless, the mechanism behind this phenomenon still arouses our deep concern and is worth expanding that we will systematically analyze it in our future work.
y H C k 1 i w I z m c d U q 1 4 u / u c N U h 3 e e B k T S a q p I M t D Y c q R j l H e A h o x S Y n m M 0 M w k c x k R W S C J S b a d J W X 4 K 5 + e Z 1 0 L + t u o 3 5 1 1 6 g 1 W 0 U d Z T i F M 7 g A F 6 6 h C b f Q h g 4 Q S O E Z X u H N e r J e r H f r Y z l a s o q d E / g D 6 / M H 2 f G T O Q = = < / l a t e x i t > Wk < l a t e x i t s h a 1 _ b a s e 6 4 = " x g J q E 6 / E U w v o 6 0 + g G i e O Y a v 0 y 1 Y = " > A A A B + H i c b V D L S s N A F L 2 p r 1 o f j b p 0 M 1 g E V y U R R Z e F b l x W s A 9 o Q 5 h M J + 3 Q y S T M T I Q a 8 i V u X C j i 1 k 9 x 5 9 8 4 a b P Q 1 g M D h 3 P u 5 Z 4 5 Q c K Z 0 o 7 z b V U 2 N r e 2 d 6 q 7 t b 3 9 g 8 O 6 f X T c U 3 E q C e 2 S m M d y E G B F O R O 0 q 5 n m d J B I i q O A 0 3 4 w a x d + / 5 F K x W L x o O c J 9 S I 8 E S x k B G s j + X Z 9 F G E 9 D c K s n / v Z L K / 5 d s N p O g u g d e K W p A E l O r 7 9 N Rr H J I 2 o 0 I R j p Y a u k 2 g v w 1 I z w m l e G 6 W K J p j M 8 I

<
l a t e x i t s h a 1 _ b a s e 6 4 = " g b 7 32 O j 8 r k g Y m V A o 2 H u a c d y B O a g = " > A A A B 8 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 5 E 0 T J o Y 2 E R w X x I c o S 9 z V 6 y Z G / v 2 J 0 T Q s i v s L F Q x N a f Y + e / c S + 5 Q h M f D D z e m 2 F m X p B I Y d B 1 v 5 3 C y u r a + k Z x s 7 S 1 v b O 7 V 9 4 / a J o 4 1 Y w 3 W C x j 3 Q 6 o 4 V I o 3 k C B k r c T z W k U S N 4 K R j e Z 3 3 r i 2 o h Y P e A 4 4 X 5 E B 0 q E g l G 0 0 m M X R c Q N u S v 1 y h W 3 6 s 5 A l o m X k w r k q P f K X 9 1 + z N K I K 2 S S G t P x 3 A T 9 C d U o m O T T U j c 1 P K F s R A e 8 Y 6 m i d o 8 / m R 0 8 J S d W 6 Z M w 1 r Y U k p n 6 e 2 J C I 2 P G U W A 7 I 4 p D s + h l 4 n 9 e J 8 X w y p 8 I l a T I F Z s v C l N J M C b Z 9 6 Q v N G c o x 5 Z Q p o W 9 l b A h 1 Z S h z S g L w V t 8 e Z k 0 z 6 r e e f X i / r x S u 8 7 j K M I R H M M p e H A J N b i F O j S A Q Q T P 8 A p vj n Z e n H f n Y 9 5 a c P K Z Q / g D 5 / M H 4 o C P 0 A = = < / l a t e x i t >

Figure 1 :
Figure1: Different PET methods have distinct tunable modules, which typically result in noticeable performance differences.However, as the model scale increases, these differences become less significant.

Figure 2 :
Figure2: We investigate the average performance of the tuning methods, including Prompt, BitFit, LoRA, Adapter, and full-parameter fine-tuning, on three series of models.As the model scaling increases, the performance differences (standard deviation (S.D.)) among tuning methods become smaller.
y a c a e O 6 3 0 5 p Y 3 N r e 6 e 8 W 9 n b P z g

Figure 3 :
Figure3: The tunable modules of APET methods (θ = {W 1 , W 2 , ..., W L }) are composed L tunable weights W with arbitrary structures.There are three operations (ADD, CONCAT, PLUG) for inserting these tunable weights W into a PLM.

Figure 4 :
Figure4: The parameter quantity in each group (bar: h ) corresponds to the aforementioned four PET methods'.We denote the APET methods with the coresponding numbers of parameters as APET Prompt , APET BitFit , APET LoRA , and APET Adapter , respectively.Each APET method will arbitrarily select tunable parameters with different random seeds, each random seed representing a different parameter distribution.Here, S.D. means the standard deviation.As the model scaling increases, the impact caused by the parameter position on the performance becomes minor.

Figure 6 :
Figure6: The tunable modules of APET methods are composed p tunable weights W, which can be expressed as θ = {W 1 , W 2 , ..., W L }.We introduce three operations to insert the tunable weight W into the PLM and the corresponding transformation.
on BERT LARGE , and [4.0e+2, 1.6e+3] on BERT SMALL ; [8.2e+3, 4.1e+4] on BLOOM 7.1B , [8.2e+3, 4.1e+3] on BLOOM 560M ; (Ben Zaken et al., 2022)ends N p tunable soft tokens, i.e. embeddings, to the input sentences and asks the model to predict the probability of the next word.During training, only the newly added embeddings are optimized and the backbone model is frozen.BitFit BitFit(Ben Zaken et al., 2022)is a method that only tunes all the bias terms W b ∈ R d in the PLM, which lie in the self-attention and layer norm layers.