Parameter-efficient Tuning for Large Language Model without Calculating Its Gradients

Fine-tuning all parameters of large language models (LLMs) requires significant computational resources and is time-consuming. Recent parameter-efficient tuning methods such as Adapter tuning, Prefix tuning, and LoRA allow updating a small subset of parameters in large language models. However, they can only save approximately 30% of the training memory requirements because gradient computation and backpropagation are still necessary for these methods. This paper proposes a novel parameter-efficient tuning method for LLMs without calculating their gradients. Leveraging the discernible similarities between the parameter-efficient modules of the same task learned by both large and small language models, we put forward a strategy for transferring the parameter-efficient modules derived initially from small language models to much larger ones. To ensure a smooth and effective adaptation process, we introduce a Bridge model to guarantee dimensional consistency while stimulating a dynamic interaction be-tween the models. We demonstrate the effectiveness of our method using the T5 and GPT-2 series of language models on the SuperGLUE benchmark. Our method achieves comparable performance to fine-tuning and parameter-efficient tuning on large language models without needing gradient-based optimization. Additionally, our method achieves up to 5.7 × memory reduction compared to parameter-efficient tuning.


Introduction
Large language models such as GPT3 (Brown et al., 2020), GPT4 (OpenAI, 2023), T5-XXL (Raffel et al., 2020), andLLaMA (Touvron et al., 2023) have demonstrated remarkable capabilities in various natural language processing tasks.However, the sheer number of parameters in these models poses challenges for fine-tuning on common hardware.Parameter-efficient tuning methods (e.g., Adapter tuning, Prefix tuning, or LoRA) typically involve adding a small number of parameters to the language model and only fine-tuning the added subset of parameters, achieving comparable performance to full fine-tuning.Adapter tuning learns the task-specific information (Houlsby et al., 2019;Mahabadi et al., 2021b,a) by inserting small taskspecific modules within layers of the Transformer.Prefix tuning (Li and Liang, 2021;Liu et al., 2021b) prepends task-specific trainable prompt tokens to the hidden states within every intermediate Transformer layer.LoRA (Hu et al., 2021) merges the low-rank and trainable matrices with the frozen weights at each layer of the Transformer.In Figure 1(a), we show the comparison of the GPU memory usage during training for fine-tuning, Adapter tuning, LoRA, and Prefix tuning on the Recognizing Textual Entailment (RTE) (Bar-Haim et al., 2014) task.These parameter-efficient tuning methods can save approximately 30% of the GPU memory requirements but still rely on gradient-based optimization, resulting in increased memory demands for LLMs.
To address the above limitations, we propose a novel parameter-efficient tuning method for LLMs without calculating their gradients.Intuitively, both large and small language models (SLMs) can learn similar task-specific characteristics when applied to downstream tasks.We conduct experiments using existing parameter-efficient tuning methods on the RTE task in the SuperGLUE benchmark (Wang et al., 2019) to validate this hypothesis.As depicted in Figure 1(b), we calculate the similarity between the parameter-efficient tuning modules derived from GPT2-XL and GPT2-base or GPT2-medium1 .Specifically, we apply the LoRA method on GPT2-XL and GPT2-medium to obtain the parameter-efficient modules of the two language models and calculate the cosine similarity of the two parameter-efficient modules.We find that the cosine similarity can reach up to 75%.Our observations indicate that these modules exhibit comparable task-specific characteristics throughout the learning process for specific downstream tasks.Inspired by these findings, if we can successfully transfer the task-specific characteristics learned by the small language model to the large language model, we can enrich the task-specific capabilities into the LLMs without needing gradient-based optimization.
In this paper, we first utilize existing parameterefficient tuning methods in a small language model to learn the task characteristics of downstream tasks.Intuitively, we can directly apply the parameter-efficient module obtained from the small language model to the large language model.However, this would face crucial issues of dimension mismatch and limited interaction with the large language model.To address the issue of dimension mismatch, we employ a projection module to align the dimensions of the parameter-efficient modules between SLMs and LLMs.Furthermore, to enrich the interaction between the parameterefficient module and the large language model, we introduce a Bridge model that can retain the knowledge of the large language model while interacting with the parameter-efficient module, obtaining a parameter-efficient module with dimensions matching the large language model.Finally, we seamlessly plug the acquired parameter-efficient module into the large language model for inference.We conduct comprehensive experiments on T5 series (Raffel et al., 2020) and GPT2 series (Radford et al., 2019) of language models to assess the effectiveness of our method using the SuperGLUE benchmark, a widely recognized evaluation benchmark for natural language understanding.The results demonstrate that our method performs on par with fine-tuning and parameter-efficient tuning on large language models without needing gradient-based optimization.Additionally, our proposed method achieves up to 5.7× memory reduction compared to parameter-efficient tuning.Our findings highlight the potential of bridging small and large language models, thereby efficiently leveraging expansive large language models.In summary, our key contributions can be listed as follows: • Our analysis reveals a substantial task similarity when applying parameter-efficient tuning methods to SLMs and LLMs for downstream tasks.
• We propose a gradient-free method to adapt the parameter-efficient modules learned in a small language model to a large language model.
• Extensive experiments on SuperGLUE benchmark under both T5 series and GPT-2 series of language models verify the effectiveness of our proposed method and achieve up to 5.7× memory reduction compared to parameterefficient tuning.

Method
The proposed method utilizes parameter-efficient tuning modules to effectively bridge small and large language models, enriching the task-specific capabilities into the large language model without needing gradient-based optimization.As depicted in Figure 2, the method consists of training and inference.We first employ parameter-efficient tuning methods during the training stage to learn task-specific characteristics in the small language model.Then, we fine-tune the Bridge model and the acquired parameter-efficient module to enhance the knowledge of the parameter-efficient module.Finally, we directly plug the parameter-efficient module into the large language model during the inference stage for efficient predictions.
Plug-in and Bridge Model Fine-tuning: We first utilize existing parameter-efficient tuning methods such as Adapter (Houlsby et al., 2019), LoRA (Hu  (Li and Liang, 2021) in a small language model to learn the task characteristics of downstream tasks.However, directly applying the parameter-efficient modules obtained from the small language model to the large language model would face two issues: dimension mismatch and limited interaction with the large language model.To address the issue of dimension mismatch, we employ a linear projection module W plug as the Plug-in model to align the dimensions of the parameter-efficient module with the large language model.Furthermore, considering that small language models usually have fewer layers than large language models, we address the layer mismatch by duplicating the layers of the parameter-efficient modules.This duplication enables us to achieve layer alignment with the large language model.Intuitively, successfully adapting the parameterefficient modules to the large language models requires a substantial interaction between them and the large language models.To enrich the interaction, we introduce a Bridge model that can retain the knowledge of the large language model while interacting with the parameter-efficient module.We employ the pruning method from Ladderside-tuning (Sung et al., 2022) to obtain such a Bridge model, which involves pruning each layer of the large language model f .We use linear projections to downsample the intermediate activa-tions, including word embeddings, from the large language model f to a lower-dimensional Bridge model g, with a reduction factor of r, where r can be 8, 16, 32, 64, etc.To retain crucial information from the large language model, we leverage Fisher information (Liu et al., 2021a;Sung et al., 2021) to prune the parameters of the large language model and obtain the initial Bridge model g.Fisher information could effectively evaluate the importance of parameters in the large language model.Given the W ∈ R d b ×d l of the backbone network that maps the d l -dim vectors to the d b -dim space, we calculate the importance of each weight vector through where (x i , y i ) are samples from data D.Then, we keep the rows and columns of the W , which have the d b r and d l r importance scores.Through iterations of this process in each layer of the Transformer, we obtain a set of weight matrices r that have undergone pruning 1/r times from the backbone network, and we utilize them to initialize the Bridge model.
Subsequently, in order to make the parameterefficient modules (e.g., Adapter, LoRA, or Prefix tuning) learned in the small language model interact with the Bridge model, we apply the linear

Algorithm 1 Adaptation of Parameter-efficient Modules for Large Language Model
Require: Large language model M , a small language model S, the training set D = {(x 1 , y 1 ), • • •, (x n , y n )}, linear projection modules W plug and W down , and parameter-efficient tuning methods (PEFT) (e.g., Adapter tuning, Prefix tuning, and LoRA) 1: Apply the PEFT on S and fine-tuning S on D to obtain parameter-efficient modules 2: Employ the W plug to align the dimensions of the parameter-efficient modules with M 3: Prune the parameters of the M and get a Bridge model g 4: Apply the linear projection module W down on parameter-efficient modules 5: for each instance ( x 1 , y 1 ) in D do   6: Fine-tune the parameter-efficient modules together with the Bridge model g 7: end for 8: Plug the parameter-efficient modules and the linear projection modules W plug into the M projection module W down on parameter-efficient modules and fine-tune the parameter-efficient modules together with the Bridge model g.This finetuning process enables us to achieve two objectives: obtaining parameter-efficient modules that match the dimensions of the large language model and enriching these modules with knowledge from the large language model.Inference: Once the training of the parameterefficient modules and the Bridge model g is complete, we integrate the trained parameter-efficient modules, enriched with knowledge from the large language model, into the large language model.This integration empowers the large language model to leverage the task-specific knowledge captured by the parameter-efficient modules during the inference process without requiring gradient-based optimization.The complete algorithm is depicted in Algorithm 1.

Experimental Settings
We conduct extensive experiments on eight natural language understanding tasks from the SuperGLUE benchmark, including BoolQ (Clark et al., 2019), CB (De Marneffe et al., 2019), COPA (Roemmele et al., 2011), MultiRC (Khashabi et al., 2018), RTE (Bar-Haim et al., 2014), WiC (Pilehvar and Camacho-Collados, 2019), WSC (Levesque et al., 2012), and ReCoRD (Zhang et al., 2018).For each task, we report the accuracy or F1-score.In our experiments, we evaluate the effectiveness of our method using both the GPT2 series (Radford et al., 2019) of autoregressive language models and the T5 series (Raffel et al., 2020) of sequence-tosequence language models.In the GPT2 series of models, we designate the GPT2-base as the small model and the GPT2-XL as the large model.For the T5 series of models, we classify the T5-base and T5-large as small language models, while the T5-3B and T5-XXL are considered large language models for our experiments.To obtain the Bridge model, we set the reduction factor r = 16.In Table 2, we provide the model parameters of the small language models, large language models, and Bridge models.It is worth noting that the model parameters of the Bridge models are significantly smaller than those of the large language models.The training process of our proposed method is conducted on an NVIDIA A100 GPU with 80GB of memory.
Our objective is to demonstrate that the parameter-efficient modules learned on the small language models can be effectively adapted to the large language models, achieving comparable performance to full fine-tuning and parameter-efficient tuning without needing gradient-based optimization.We list the baselines as follows: Fine-tuning: The vanilla Transformer finetuning.
Adapter tuning: Inserting a small task-specific module between the self-attention module (and the MLP module) and the subsequent residual connection at each Transformer layer (Houlsby et al., 2019).
Prefix tuning: Adding trainable continuous prompt vectors to the Key and Value components of the attention layer at each layer of the Transformer model (Li and Liang, 2021).
LoRA: Merging the low-rank and trainable matrices with the frozen weights at each layer of the Transformer (Hu et al., 2021).
Adapter tuning Plug-in: By applying the Adapter tuning method to small language models, we obtain parameter-efficient modules that can be adapted to large language models.Prefix tuning Plug-in: By applying the Prefix tuning method to small language models, we obtain parameter-efficient modules that can be adapted to large language models.
LoRA Plug-in: By applying the LoRA method to small language models, we obtain parameterefficient modules that can be adapted to large language models.

Experiments on GPT-2 Series of Models
Table 1 shows the performance of our proposed method using the GPT2 series of models (Radford et al., 2019).In Table 1, we present the results obtained with GPT2-base as the small language model and GPT2-XL as the large language model.The Adapter Plug-in, Prefix tuning Plug-in, and LoRA Plug-in are the parameter-efficient modules learned in the GPT2-base model adapted to the GPT2-XL model.As can be seen, in comparison to directly conducting vanilla fine-tuning on the large language model, our method achieves comparable results.Furthermore, compared to parameter-efficient tuning methods applied to the large language model, our method demonstrates comparable performance without the need to finetune any parameters of the large language model.Specifically, it exhibits a slight improvement compared to the prefix-tuning method on the BoolQ and COPA tasks, demonstrating the effectiveness of our method.

Experiments on T5 Series of Models
Table 3 and Table 4 display the performance of our proposed method using the T5 series of models (Raffel et al., 2020).In Table 3, we present the results obtained with T5-base as the small language model and T5-3B as the large language model.The Adapter Plug-in, Prefix tuning Plug-in, and LoRA Plug-in are parameter-efficient modules of our method used to adapt T5-3B.Our method achieves comparable results on all eight Super-GLUE tasks without the need to fine-tune any parameters of the T5-3B, demonstrating the effectiveness of our method.Particularly in BoolQ, CB, RTE, and ReCoRD, our method performs at a level similar to full fine-tuning.Similarly, Table 4 shows the results using T5-large as the small language model and T5-XXL as the large language model, with the same Plug-ins employed to adapt T5-XXL.
Our method achieves comparable results on all eight SuperGLUE tasks without the need to finetune any parameters of the T5-XXL.Furthermore, compared to parameter-efficient tuning methods applied to the large language model, our method demonstrates comparable performance without the need to fine-tune any parameters of the large language model.Especially when compared to the Table 4: Results on the SuperGLUE benchmark.The Adapter Plug-in, Prefix tuning Plug-in, and LoRA Plug-in are the parameter-efficient modules learned in the small language models adapted to the large language models.The full fine-tuning performance of T5-large and T5-XXL models is based on the research conducted by Raffel et al. (2020).0% means we do not update any parameter within the large language models.
prefix-tuning method, our proposed method shows a slight improvement in the CB, COPA, and WSC tasks when applied to the T5-3B model.Similarly, when applied to the T5-XXL model, our method slightly improves the BoolQ, CB, and ReCoRD tasks.
It is evident that as the parameter size of pretrained language models grows, existing parameterefficient tuning methods necessitate the addition of more parameters to the large language model.In contrast, our method does not entail any parameter augmentation but instead optimally harnesses the capabilities of the large language model through a plug-in approach.These findings indicate that our method can effectively utilize the knowledge of large language models without updating the pa-rameters of large language models, which suggests the potential for application to even larger language models.

Importance of the Reduction Factor r
Considering the impact of the reduction factor r on the amount of knowledge retained in the Bridge model, we conduct experiments to analyze its importance.For verification, we select CB, RTE, and WiC in the SuperGLUE benchmark.As indicated in Table 5, we observe a gradual decrease in model performance as r increases.This is because a higher value of r reduces the retained knowledge from the large language model to the Bridge model.According to Table 5: The accuracy on CB, RTE, and WiC with different reduction factors r.We perform the verification on T5-3B using the LoRA Plug-in approach.
are significantly smaller than that of the small language model, our method exhibits a noticeable performance decline during the entire inference process.However, when the model parameters of the Bridge model are comparable to that of the small language model, our method maintains the performance of the model without substantial degradation.To strike a balance between model performance and the parameter of the Bridge model, we believe that selecting r = 16 is a suitable choice.

Memory Usage
Our method not only achieves comparable performance to full fine-tuning and parameter-efficient tuning methods without updating any parameters of the large language models but also significantly achieves an impressive reduction in memory usage.As shown in Table 6 and Table 7, we compare the memory usage between our proposed method and the baseline models.When using the GPT2-XL model, we conduct experiments with a batch size of 8 and a sequence length 512.Our method evidently achieves up to 7.1× memory savings compared to vanilla fine-tuning.Similarly, when utilizing the T5-3B model with a batch size of 1 and a sequence length 512, our method achieves up to 5.1× memory savings compared to vanilla fine-tuning.
In particular, we compare the existing parameterefficient tuning methods, and in GPT2-XL, we can see that our proposed method can achieve significant memory savings.For example, comparing the three parameter-efficient tuning methods of Adapter-tuning, Prefix tuning, and LoRA, our proposed method can achieve 5.3×, 5.6×, and 5.7× memory reduction, respectively.Our method also achieves 2.9× more memory savings compared to Ladder-side Tuning.In T5-3B, our proposed method can achieve 3.6×, 3.5×, and 3.6× memory reduction, respectively, compared to the Adaptertuning, Prefix tuning, and LoRA.Similarly, our method also achieves 1.3× more memory savings compared to Ladder-side Tuning.This demonstrates that our method can be more effectively applied to existing large language models without compromising performance.Furthermore, our proposed method does not slow down the inference speed of the model.By utilizing the plug-in approach, we can directly leverage the knowledge of the large language model during inference without compromising the speed.

Utilize Bridge Model Directly?
A natural question arises: Why not learn the parameter-efficient modules directly on the Bridge model instead of using a smaller language model and then applying the learned efficient modules to the larger language model?Intuitively, given the learned parameter-efficient modules with the Bridge model, we need a projection model to project the dimensions to match the large language model.However, the projection model can only be well learned with another interaction model.In this section, we train the Bridge model by directly initializing the parameter-efficient and lin-

Related Work
Fine-tuning and Parameter-efficient Tuning: Large language models leverage parameterized Transformers as a foundational framework and train them on extensive unsupervised language corpora (Brown et al., 2020;OpenAI, 2023;Touvron et al., 2023;Raffel et al., 2020).Subsequently, specific task objectives are introduced for downstream tasks to perform full fine-tuning on the pretrained language models (Jin et al., 2022(Jin et al., , 2023;;Muennighoff et al., 2022).As the primary method for optimizing pre-trained language models, full fine-tuning involves initializing the model with pretrained weights, updating all model parameters, and storing a separate fully optimized model for each downstream task.However, as the model parameter grows, conducting full fine-tuning on existing computational devices becomes increasingly challenging.In light of this, researchers have started exploring efficient methods for effectively harnessing the power of large language models.As an efficient alternative, parameter-efficient tuning is a promising way of stimulating large language models.Compared to vanilla fine-tuning, parameter-efficient tuning methods only tune a small portion of the model parameters while keeping the rest frozen.Adapter tuning methods, such as those proposed by Houlsby et al. (2019) and Mahabadi et al. (2021b,a), aim to learn taskspecific information by incorporating small-scale task-specific modules into the layers of the Transformer.Prefix tuning methods, as introduced by Li and Liang (2021) and discussed in the work by Liu et al. (2021b), also introduce additional parameters within the Transformer layers.LoRA, proposed by Hu et al. (2021), merges low-rank and trainable matrices with the frozen weights at each layer of the Transformer.BitFit (Ben Zaken et al., 2022) is a simple yet effective method that optimizes the bias terms within a model while keeping the other parameters frozen.However, these existing parameter-efficient tuning methods typically rely on gradient-based optimization and still involve substantial memory usage.Our proposed method, on the other hand, enables substantial memory savings while maintaining comparable performance.Gradient-free Optimization: Recently, Sung et al. (2022) introduced a method that eliminates the need for gradient updates by directly applying a pruned model to downstream tasks, while the method does not fully exploit the knowledge of the large language model.Xiao et al. (2023) propose an efficient transfer learning framework that can adapt large language models to downstream tasks without access to full model parameters, while the method through compute-intensive distillation techniques may be cost-prohibitive for larger models.In contrast, our proposed method enables further utilization of the knowledge contained in the large language model and enables significant memory savings through a simple operation while preserving the performance of large language models.

Conclusion
This paper proposes a novel parameter-efficient tuning method for large language models without calculating their gradients.We first learn the parameter-efficient tuning module for small language models.Then, the learned parameterefficient tuning module is adapted into large language models with a bridge model that handles the dimensionality mismatch and enables interaction between the parameter-efficient tuning module and the large language model.Extensive experiments on the SuperGLUE benchmark demonstrate that our method achieves comparable performance to vanilla fine-tuning and parameter-efficient tuning on large language models without needing gradient-based optimization.We believe that our method offers a potential direction to utilize large language models efficiently and economically.

Limitations
For large language models for which weights cannot be obtained, the proposed methods cannot be directly applied, and there may be certain limitations when applying the proposed methods to language models with different architectures.Continuous exploration and research will be conducted to determine how to apply the proposed methods to different architecture language models, aiming to improve their compatibility and effectiveness across different architectures and language model characteristics.
The proposed method still updates the parameters of PEFT modules based on the gradient on a small language model, and the whole process is a pipeline process, which requires practical training in the early stage before it can be applied to the large model.

Fine
Figure 1: (a) Comparison of the GPU memory usage during training for fine-tuning, Adapter tuning, LoRA, Prefix tuning, and our method on the RTE task.(b) The cosine similarity of the parameter-efficient modules on the GPT-2 series of language models over the RTE task.
Figure 2: (a) In the training stage, we use parameter-efficient tuning methods to learn task-specific characteristics in a smaller model and fine-tune the Bridge model with the acquired parameter-efficient modules.(b) In the inference stage, we directly plug the parameter-efficient modules into the large language model for efficient predictions.

Table 1 :
Results on the SuperGLUE benchmark.The Adapter Plug-in, Prefix tuning Plug-in, and LoRA Plug-in are the parameter-efficient modules learned in the small language models adapted to the large language models.0% means we do not update any parameter within the large language models.

Table 3 :
Raffel et al. (2020)GLUE benchmark.The Adapter Plug-in, Prefix tuning Plug-in, and LoRA Plug-in are the parameter-efficient modules learned in the small language models adapted to the large language models.The full fine-tuning performance of T5-base and T5-3B models is based on the research conducted byRaffel et al. (2020).0% means we do not update any parameter within the large language models.
Table 5, we can observe that when the model parameters of the Bridge model

Table 6 :
Memory Usage (MU) and Memory Reduction (MR) compared to vanilla fine-tuning of our proposed method on a single NVIDIA A100 GPU with 80GB of memory.Batch sizes are 8 and sequence lengths are 512.