Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization

Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. However, the training process of Large Language Models (LLMs) generally incurs the update of significant parameters, which limits the applicability of FL techniques to tackle the LLMs in real scenarios. Prompt tuning can significantly reduce the number of parameters to update, but it either incurs performance degradation or low training efficiency. The straightforward utilization of prompt tuning in the FL often raises non-trivial communication costs and dramatically degrades performance. In addition, the decentralized data is generally non-Independent and Identically Distributed (non-IID), which brings client drift problems and thus poor performance. This paper proposes a Parameter-efficient prompt Tuning approach with Adaptive Optimization, i.e., FedPepTAO, to enable efficient and effective FL of LLMs. First, an efficient partial prompt tuning approach is proposed to improve performance and efficiency simultaneously. Second, a novel adaptive optimization method is developed to address the client drift problems on both the device and server sides to enhance performance further. Extensive experiments based on 10 datasets demonstrate the superb performance (up to 60.8\% in terms of accuracy) and efficiency (up to 97.59\% in terms of training time) of FedPepTAO compared with 9 baseline approaches. Our code is available at https://github.com/llm-eff/FedPepTAO.


Introduction
As a promising paradigm to handle decentralized data, Federated Learning (FL) (Kairouz et al., 2021) enables collaborative model training without transferring the raw data across a massive number of devices.As a bunch of legal restrictions (Official Journal of the European Union, 2016; Californians for Consumer Privacy, 2020) have been implemented, aggregating the decentralized data into a central server or data center becomes complicated or even impossible (Yang et al., 2019).FL generally exploits a parameter server module (Liu et al., 2023b) to manage the distributed model updates in devices, which only exchanges the parameters of the updated models instead of the raw data, between the parameter server and devices.
Large Language Models (LLMs) (Devlin et al., 2018;Liu et al., 2019b;Brown et al., 2020;Lewis et al., 2019) have achieved major advances in Natural Language Processing (NLP) tasks.The scale of LLMs can range from 110 million parameters to 175 billion parameters, which correspond to huge communication and computation costs to update parameters during the pre-training process (Sanh et al., 2022;Wang et al., 2022) or fine-tuning process (Ding et al., 2023).Both pre-training and fine-tuning update the whole set of parameters of the language model.Thus, the application of FL in the pre-training or the fine-tuning process is almost impossible due to significant communication burden brought by large amount of parameters.
Prompt design (Brown et al., 2020) can lead to excellent performance while freezing the original LLMs.When the LLMs are frozen, only the prompts or prefix are updated during the tuning process, which can significantly reduce the number of parameters to update.For instance, for a sample from a sentiment analysis task (e.g., "beautiful place!"), a discrete prompt "It was [MASK]."for prompt tuning (Brown et al., 2020) and continuous task-specific vectors for prefix tuning (Li and Liang, 2021) can be concatenated to be sent to a LLM, which generates the label of the sample to be "terrible" or "great".
Numerous parameter-efficient prompt or prefix tuning approaches have been proposed to tune the large language models through updating a few train-able parameters while achieving comparable performance compared with fine-tuning.In order to avoid human involvement in the prompt design, prompt tuning methods (Shin et al., 2020) are proposed to search proper prompts within a discrete space of words, which corresponds to inferior performance compared with fine-tuning.Continuous prompts, i.e., prefix tuning, can be updated to achieve better performance (Liu et al., 2021;Lester et al., 2021).However, this approach leads to sub-optimal performance for the models with less than 10 billion parameters.Although P-tuning V2 (Liu et al., 2022d) achieves comparable performance compared with fine-tuning, it introduces more parameters, which may correspond to heavier communication costs in the setting of FL compared with other parameterefficient tuning approaches.Some other parameterefficient prompt tuning methods either suffer from low performance with the focus on low-rank hypercomplex adapter layers (Karimi Mahabadi et al., 2021a) or prompt with a single layer (Liu et al., 2022c), or introduce extra computation costs with attentions (Asai et al., 2022).In addition, existing FL techniques for fine-tuning large language models typically incur performance degradation or low efficiency due to huge communication costs (Tian et al., 2022;Sun et al., 2022;Zhao et al., 2023).
Adaptive optimization methods, e.g., Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD) with Momentum (SGDM) (Sutskever et al., 2013), have been utilized either on server side (Duchi et al., 2011;Reddi et al., 2018a) or on device side (Yuan et al., 2021;Liu et al., 2020;Gao et al., 2021a;Wang et al., 2020) to achieve superior performance in FL.However, the direct application of the adaptive optimization methods may incur problems of the convergence within the training process (Reddi et al., 2018b).Furthermore, the application of adaptive optimization on a single side, i.e., either device or server, may correspond to poor performance.However, when the adaptive optimization is exploited on both sides (Jin et al., 2022a) may incur heavy communication costs.In addition, client drift (Karimireddy et al., 2020b) may exist in terms of the adaptive optimization due to non-Independent and Identically Distributed (non-IID) data among devices.
In this paper, we propose a Parameter-efficient prompt Tuning approach with Adaptive Optimization, i.e., FedPepTAO, to tune large language models with FL.As transferring the whole set of pa-rameters in all the prompt layers corresponds to heavy communication costs, we propose an efficient and effective method to choose proper layers of prompts based on the importance of each layer.We design a scoring method to identify the importance of each layer according to the tuning impact of the layer on the final convergence accuracy.In addition, we propose an adaptive optimization method on both server side and device side with control measures on each device to achieve superb accuracy.We summarize out major contributions as follows: • We propose a novel parameter-efficient prompt tuning method with an efficient and effective method to choose proper layers of prompts for FL.The subset of layers of prompts can reduce both the communication and computation costs within FL.
• We provide an original adaptive optimization method on both server side and device side with control measures on each device.
• We carry out extensive experimentation based on 10 datasets, which demonstrates the advantages of FedPepTAO in terms of accuracy (up to 60.8% higher) and efficiency (up to 97.59% faster) compared with 9 baseline approaches.

Related Work
As updating all the parameters of a pre-trained LLM consumes a large amount of memory and computation resources, prompt tuning (Brown et al., 2020) or prefix tuning (Li and Liang, 2021) is proposed to update a few parameters with a frozen language model while achieving comparable performance compared with fine-tuning.While prompt tuning may correspond to inferior performance with discrete space of words (Shin et al., 2020), prefix tuning (Liu et al., 2021;Lester et al., 2021) can deal with continuous prompts to achieve better performance.Adapter modules (Houlsby et al., 2019) are exploited to tune large language models with prompts, which may incur heavy computation costs due to the calculation of feed-forward project and non-linearity or attention mechanism (Asai et al., 2022).Although efficient low-rank hypercomplex mechanism (Karimi Mahabadi et al., 2021b) can be utilized to reduce parameters to update, the performance may degrade.P-tuning V2 achieves comparable performance compared with fine-tuning (Liu et al., 2022d), which the prompts added into each layer of the large language model.Prompts can be added at a single layer to further reduce the computation costs (Liu et al., 2022c), which may incur performance degradation and depends on multiple trials with each layer.In addition, the selection of the layer may incur long execution time to verify the impact on the final accuracy.Although the NASWOT algorithm (Mellor et al., 2021) can be exploited to analyze the performance of a neural network architecture, it is only compatible with the neural networks based on the ReLU activation function (Nair and Hinton, 2010;Glorot et al., 2011).
Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD) with Momentum (SGDM) (Sutskever et al., 2013) are exploited within FL on server side (Duchi et al., 2011;Reddi et al., 2018a) or on device side (Yuan et al., 2021;Liu et al., 2020;Gao et al., 2021a;Wang et al., 2020) to address the client drift problem brought by the non-IID data in FL (Karimireddy et al., 2020b).However, the direct application of adaptive optimization on devices may lead to convergence problem (Reddi et al., 2018b).In addition, the application of adaptive optimization on both server and device sides may incur heavy communication costs (Jin et al., 2022a).
Different from the previous work, we propose a general scoring method to analyze the correlation of each layer and the output of the large language model, which can represent the importance of each layer.Then, we select the prompt parameters of proper layers to be updated with FL while leaving other prompt parameters of other layers to be adjusted locally with a lossless method so as to achieve superb performance with limited communication costs.In addition, we introduce control measures on each device to alleviate the client drift problem and propose a novel adaptive optimization method on both server and device sides to further improve the performance.

Problem Formulation
The problem to address in this paper is how to efficiently tune a large language model based on prompt tuning in FL.Given a large language model M with L layers, we add prompts for each layer in M and denote the set of parameters to generate prompts by P l with l representing the number of layer in M. The whole set of prompts is denoted by P During the tuning process, the parameters in M are frozen and cannot be updated while the parameters in P are updated to improve the performance of M.
We consider a FL setting with a parameter server and M devices.We assume that the data for the tuning process of M is distributed among multiple devices.On each Device i, a dataset D i = {s i , m i } n i is located with s i , m i , and n i representing a sample, the corresponding label of s i , and the number of samples in D i .We denote the total number of the samples on all the devices by N , the set of all the samples by S and that of labels by M .Due to the limited computation capacity, each Device i can only perform the inference of M while updating P i with P i representing the prompt parameters in Device i.In order to reduce communication costs, we [CLS] Pretty dull .
Prediction (with linear head) enable the exchange the parameters of the prompts within a subset of selected layers SL i between Device i and the parameter server while the other prompt parameters are only updated within each device.We denote the set of prompt parameters in all the devices by P. The problem to address in this paper can be formulated as how to efficiently generate P such that the global loss is minimized: (1) where F(M, P) represents the global loss, refers to the loss function on Device k with f (M, p i , s i , m i ) calculating the local loss of the combination of the large language model M and prompt parameters p i on {s k , m k }.
For NLP tasks, each sample s k ∈ S is the input of the large language model and m k ∈ M is the corresponding label.Each sample s k is composed of multiple tokens, i.e., s k = {s 1 k , s 2 k , ..., s t k }, where t represents the length of the input.The prompt p consists of multiple tokens p = {p 1 , p 2 , ..., p h }, and the corresponding prompt parameters can be trained.The prompts differ according to layers.We denote the template by T (•), which defines how to concatenate the input tokens with the prompt.For instance, s p k = T (s k , p) represents the sample combined with the prompt, which contains one [MASK] token.The output of the large language model with the prompts predicts the label m k , which corresponds to the [MASK] token after applying a verbalizer V(•), i.e., mk = V(o k ) with o k representing the output of the model and mk referring to the predicted label.
In this section, we first present the system model of FedPepTAO.Then, we propose parameterefficient prompt tuning method and the adaptive optimization method, respectively.

System Model
As shown in Figure 1, we consider a parameter server and multiple devices for the tuning process of FedPepTAO.We assume that a large language model is deployed on each device.For each layer, we insert a prompt module.During the tuning process, the large language model only perform inference while the prompt modules of each layer perform both the inference of the input and the update of parameters.Within the FL tuning process, the prompt parameters of specific layers are communicated between the device and the server.
During the FL tuning process of FedPepTAO, the prompt parameters in each device are updated with multiple rounds.Each round consists of five steps.First, a set of devices are selected to perform the update of prompt parameters.Second, these devices receive the corresponding updated prompt parameters of specific layers from the server ( 1 ⃝).The selection of the specific layers is based on our parameter-efficient prompt tuning method (see details in Section 3.2) ( 2 ⃝ -3 ⃝).Third, the prompt parameters are updated with our adaptive optimization method (see details in Section 3.3) based on the data on each device ( 4 ⃝).Fourth, the prompt parameters of specific layers are sent back to the server ( 5 ⃝).Fifth, the prompt parameters are aggregated on the server with the adaptive optimization method ( 6⃝ -7 ⃝).

Parameter-efficient Prompt Tuning
We propose a parameter-efficient prompt tuning method to efficiently tune the language model with FL.Instead of synchronizing the full set of prompt parameters, we select a proper set of layers for each device and only exchange the prompt parameters of these layers during the tuning process.In this section, we propose a scoring method to measure the importance of each layer.Then, we propose a lossless layer selection method to select the proper layers, which reduces the communication costs without performance degradation.
Given the prompt parameters based on any activation function, we can calculate the hidden states of each parameter at each layer of the large language model.With a batch of local data samples S i = {s i } n i mapped through the large language model and the prompt parameters corresponding to the function f p (s i ), the hidden state corresponding to Node k at l-th layer is f p k,l (s i ).Then, the hidden states of Layer l corresponding to Sample s i is As the difficulty for a network to learn to separate the input samples has positive correlation with the similarity of the hidden states (Mellor et al., 2021), we examine the correlation between the hidden states of any two layers by computing the following kernel matrix: (2) where Cos(h i,L , h i,1 ) represents the cosine similarity between two vectors (Dehak et al., 2010).Then, we calculate the eigenvalues of the kernel matrix Λ i = {λ i,1 , λ i,2 , .., λ i,L }, with λ l representing the distinction of Layer l compared with other layers based on Sample s i .Afterward, we compute the score (ζ i,l ) of Layer l with the local dataset on Device i using the Formula 3, which can avoid the possibility of unacceptable performance penalty due to abnormal eigenvalues (Gao et al., 2021b).
where ϵ refers to a small positive value, e.g., 1 * e −5 .We calculate the global score of each layer leveraging Formula 4 with for l ∈ L do SL ← SL ∪ l 19: end while where γ represents the variable.
In order to efficiently tune the large language model without performance degradation, we exploit a lossless method as shown in Algorithm 1 to select the set of proper layers within FL.Within first t rounds, the prompt parameters of all the layers are communicated between the server and each device.t can be small, e.g., 5 or 10.At t-th round, we perform the layer selection.First, we calculate ∆ i as the changement of the prompt parameters (Line 3).Then, we calculate the Hessian matrix (based an efficient PyHessian library (Yao et al., 2020)) of the current model (Line 4), the corresponding eigenvalues (Line 5), and sort the eigenvalues in ascending order (Line 6).Afterward, we construct a base function in Line 7 with ▽F i representing the gradients and calculate the Lipschitz constant of the base function 8.We take the first k that can meet the constraint in Line 9, and calculate the minimum remaining prompt parameter ratio in the selected layers R i , inspired by (Zhang et al., 2021), which can achieve lossless compared with those at all the layers.We calculate the score of each layer in Lines 11 -13.The execute of Lines 3 -13 can be carried out in parallel on each device.We aggregate the prompt parameter ratio and the scores based on Formula 4 from each device to the server (Line 15).Then, we sort the layers according to the scores in descending order (Line 16).Finally, we add the layers into the selected layer set based on the scores in descending order (Lines 17 -19), with P ara(SL i ) representing the number of parameters in the selected layer set.In the following rounds, the prompt parameters in SL are communicated between devices and the server.

Communication-Efficient Adaptive Optimization
While data is generally non-IID, we propose a novel communication-efficient adaptive optimization to achieve superb performance without introducing extra communication costs.In order to achieve excellent performance, we propose applying adaptive optimization on both server based on momentum (Cutkosky and Mehta, 2020) and device sides based on Adam (Kingma and Ba, 2015).We reset the first and the second momentum buffers to zero at the beginning of local update (Wang et al., 2021) to avoid extra communication of the momentum variables between the server and the device.In addition, we maintain a state for each device on the server to avoid possible client drift problem incurred by non-IID (Karimireddy et al., 2020b).
The algorithm of communication-efficient adaptive optimization for FL prompt tuning is shown in Algorithm 2. Within each round, we first randomly sample a subset of devices (Line 3).Then, the prompt parameters corresponding to the model in the last round is sent to each device (Line 5), and each selected device perform local update based on Adam (Lines 7 -8).Afterward, each selected device returns the accumulated difference of the prompt parameters to the server (Line 10).Please note that the execution of Lines 4 -11 can be performed in parallel on each selected device.We aggregate the differences based on Formula 4 (Line 12).Inspired by (Karimireddy et al., 2020b), we calculate the control variate c r i (Line 14) and the corresponding difference ∆c r i (Line 15) for each device on the server.We aggregate the control variate differences based on Formula 4 (Line 17), and w R : The final model 1: Randomly sample a subset S of devices M 4: for i ∈ S (on each device) do 5: for i ∈ S (on the server) do 14: for i ∈ S (on the server) do 21: end for

24:
Aggregate w r based on Formula 4 25: Aggregate m r based on Formula 4 26: end for calculate the global control variate (Line 18) and global gradients (Line 19).Afterword, we update the momentum (Line 21) and prompt parameters (Line 22) for each selected device.Finally, we aggregate the global prompt parameters (Line 24) and momentum (Line 25).The communication cost of Algorithm 2 depends on the size of prompt parameters, which is similar to that of FedAvg (McMahan et al., 2017), while achieving superb performance.

Experiments
In this section, we present the experimental results over 9 baselines and 10 commonly-used tasks to demonstrate the advantages of FedPepTAO.
We evaluate FedPepTAO and all other methods on RoBERTa LARGE (Liu et al., 2019b), which consists of 24 layers of transformers followed by a large language model head and 355M pretrained parameters.To demonstrate the adaptabil-ity of FedPepTAO, we carried out extra experiments with three additional decoder-based models, i.e., GPT2 LARGE model (Radford et al., 2019) with 774M parameters on MRPC, MR, SST-2 dataset, LLaMA 3B model (Touvron et al., 2023) on RTE, MRPC dataset, andLLaMA 7B model (Touvron et al., 2023) on MRPC dataset.The backbones of these models are frozen for all methods.

Evaluation of Parameter-efficient Tuning
Formula 4 calculates the global score ζ l of each transformer layer in the model, based on which the proper layers are selected to enable the communication between devices and the server.The selection of layers is critical to the performance of FL.In this section, we compare our parameterefficient prompt tuning (PEPT) with three other selection strategies, i.e., select with ascending order, descending order ( (Liu et al., 2022d)), and random order.Our PEPT method can significantly outperform ascending order (up to 5.78%), descending order (up to 1.81%), and random order (up to 2.89%) (see Figure 5 in Appendix).In addition, we conduct experiments and demonstrate that our PEPT method outperforms random layer selection strategy by 2.93% on RTE, 4.73% on MRPC, and 7.29% on CoLA dataset (see Appendix B.6 for details).

Evaluation of Adaptive Optimization
To demonstrate the effectiveness of Algorithm 2, we compare our adaptive optimization method with six baseline methods, i.e., FedAvg (McMahan et al., 2017), MomD, momentum on the server side with SGD on the device side (MomS) (Reddi et al., 2018a), momentum on the server side with control variate and SGD on the device side (MomS+Con) (Reddi et al., 2018a)

Hyperparameter Evaluation
In this section, we evaluate the performance of Fed-PepTAO with divers hyperparameters.Additional experiments are in the Appendix B. Impact of server learning rate Due to the high sensitivity of the hyperparameters in NLP tasks, we investigate the impact of divers server learning rates on the RTE dataset.We analyze the accuracy with the learning rates 1e −2 , 5e −3 , 1e −3 , 5e −4 .We find that the best performance was achieved when lr = 1e −3 , which only slightly outperforms lr = 5e −4 by 0.37% (see Figure 2 in the Appendix).This demonstrates that FedPepTAO is easy to fine-tune in practice.
Impact of heterogeneous data distribution Data heterogeneity has always been a common challenge in FL.To investigate the robustness of our approach to different degrees of non-IID data, we conduct experiments under various levels of non-IID degrees.We observe that the accuracy of FedPepTAO is the best with α ranging from 1.0 to 5.0 (see Figure 3 in Appendix).A smaller α represents a higher degree of heterogeneity.This demonstrates that our approach is robust to different data heterogeneity.

Conclusion
In this paper, we propose an original parameterefficient prompt tuning with adaptive optimization approach for large language models with FL, i.e., FedPepTAO.We dynamically calculate the score of each layer to choose proper layers of prompts for FL without accuracy degradation.In addition, we provide a novel adaptive optimization method with control variate on the server to achieve superb performance without extra communication costs.We carry out extensive experiments based on 10 tasks and 9 state-of-the-art baseline approaches, which demonstrate significant advantages of Fed-PepTAO in terms of accuracy (up to 60.8% higher) and efficiency (up to 97.59% faster).

Limitations
While our method can significantly enhance the performance and efficiency of federated prompt tuning in large language models (LLMs), we should acknowledge the sharing of prompts between clients and the server.Previous research has demonstrated that transferring additional sensitive information in Federated Learning (FL), such as predictions or embeddings, can lead to potential privacy concerns (Che et al., 2022).These concerns can be even more critical in prompt tuning scenarios, as prompts are explicitly tuned with private local data.We anticipate that evaluating and mitigating privacy risks in federated LLM prompt tuning will be an intriguing research area in the future.communication overhead from 40% to 41.55% (41.55%, 40%, 40%, 40% compared with Adapter, P-tuning v2, MomD, and MomS+AdamD, respectively) between devices and the server, as illustrated in Table 7. FedPrompt, Prompt-Tuning, IDPG, AT-TEMPT, and LPT correspond to lower communication overhead (up to 13%) since they only select one layer during tuning, which results in significantly inferior accuracy (from 17.02% to 60.8% lower) compared with FedPepTAO.

B.2 Impact of server learning rate
Due to the high sensitivity of the hyperparameters in NLP tasks, we investigate the impact of divers server learning rates on the RTE dataset.

B.3 Impact of heterogeneous data distribution
To investigate the robustness of our approach to different degrees of non-IID data, we conduct experiments under various levels of non-IID degrees.From Figure 3, we observe that the accuracy of FedPepTAO is the best with α ranging from 1.0 to 5.0.A smaller α represents a higher degree of heterogeneity.This demonstrates that our approach is robust to different data heterogeneity.

B.4 Impact of device number
To explore the scalability of our model, we conduct experiments with divers number of devices, i.e., 100, 150, and 200. Figure 4 shows the corresponding accuracy on the RTE dataset.The accuracy gap between the best and worst is only 0.73%, which demonstrates that FedPepTAO is scalable in FL settings.

B.5 Impact of diverse bandwidth
We carry out extra experimentation on two tasks, i.e., Subj and Trec with modest network bandwidth (reduced to 100 times smaller) on the RoBERTa LARGE model.We find that FedPep-TAO maintains its advantages in this setting, i.e., up to 98.71%, 94.75%, 80.61%, 95%, 97.43%, 56.52%, 84.55% faster compared with Adapter, FedPrompt, P-tuning v2, IDPG, ATTEMPT, LPT, MomS+AdamD to achieve the target accuracy, as shown in Table 8.In addition, in order to further validate the impact of randomness on different datasets, we conducted additional experiments on three datasets (RTE, MRPC, and CoLA) with three randomly selected seeds (32, 35, and 37) to testify the strength of our Parameter-efficient Prompt Tuning method.
Tables 10, 11, and 12 exhibit that FedPepTAO outperforms the random strategy by 2.91%, 4.64%, and 7.29% on RTE, MRPC, and CoLA datasets, respectively.The above experiment results indicate that our FedPepTAO method can achieve substantial improvement compared with the random strategy.We notice an inverse correlation between the performance of our Parameter-efficient Prompt Tuning (PEPT) method and the average sentence length of the three datasets.Specifically, PEPT tends to achieve a smaller performance gain on the datasets with longer average sentence length, as shown in Table 14.

Method
A reasonable explanation is that the datasets with longer average sentence length, such as RTE, often contain more latent information.More/less latent information make them easier/more difficult to be

②③
Upload  1 and   to the server ( before efficient tuning phase) Download and distribute  to all clients (before efficient tuning phase) e([CLS])

Algorithm 2
Communication-Efficient Adaptive Optimization Require: M : The set of devices w: The prompt parameters of the initial model R: The maximum number of global round α: The local step size β: The momentum parameter η = {η 1 , η 2 , ..., η R }: The set of global learning rates T = {T 1 , T 2 , ..., T M }: The set of local epoch T i on each Device i Ensure:

Figure 2 :
Figure 2: The impact of server learning rate.

Figure 3 :
Figure 3: The impact of various heterogeneity degrees.

Figure 4 :
Figure 4: Evaluation of divers number of devices.

Table 1 :
The accuracy with FedPepTAO and diverse baseline approaches.All the methods from GLUE benchmark are evaluated on development sets while other tasks are evaluated with test sets.The best results are highlighted in bold and the second bests are marked with underline.All the results are obtained using RoBERTa LARGE .

Table 2 :
The tuning time (s) to achieve a target accuracy (85% for QNLI, 92.5% for SST-2, 3% for CoLA, 77% for MRPC, 65% for RTE, 71% for BoolQ, 85% for MPQA, 88% for Subj, 78% for Trec, 91% for MR)with FedPepTAO and diverse baseline approaches."/" represents that training does not achieve the target accuracy.The best results are highlighted in bold and the second bests are marked with underline.All the results are obtained using RoBERTa LARGE .

Table 3 :
Accuracy and tuning time (s) to achieve target accuracy (75% for MRPC, 81% for MR, and 92.5% for SST-2) on GPT2 LARGE model."/" represents that training does not achieve the target accuracy.

Table 4 :
Accuracy and tuning time (s) to achieve target accuracy (75% for RTE, 81% for MRPC) on LLaMA 3B model."/" represents that training does not achieve the target accuracy.

Table 7 :
The communication overhead (s) between devices and the server with RoBERTa LARGE model.

Table 8 :
The tuning time (s) to achieve a target accuracy (88% for Subj, 78% for Trec) on RoBERTa LARGE model."/"representsthattraining does not achieve the target accuracy.In order to clarify the impact of randomness in our experiments, we conduct three experiments with random layer selection strategy on RTE dataset.As shown in Table9, FedPepTAO outperforms the random strategy with the accuracy gain of 2.53%, 2.89%, and 3.25% respectively, which demonstrates the superior performance of our FedPepTAO method.

Table 9 :
The performance of FedPepTAO and random layer choosing strategy on RTE dataset.

Table 10 :
The performance of FedPepTAO and random layer choosing strategy on RTE dataset under different random seeds

Table 11 :
The performance of FedPepTAO and random layer choosing strategy on MRPC dataset under different random seeds

Table 12 :
The performance of FedPepTAO and random layer choosing strategy on CoLA dataset under different random seeds