Boosting Inference Efficiency: Unleashing the Power of Parameter-Shared Pre-trained Language Models

Parameter-shared pre-trained language models (PLMs) have emerged as a successful approach in resource-constrained environments, enabling substantial reductions in model storage and memory costs without significant performance compromise. However, it is important to note that parameter sharing does not alleviate computational burdens associated with inference, thus impeding its practicality in situations characterized by limited stringent latency requirements or computational resources. Building upon neural ordinary differential equations (ODEs), we introduce a straightforward technique to enhance the inference efficiency of parameter-shared PLMs. Additionally, we propose a simple pre-training technique that leads to fully or partially shared models capable of achieving even greater inference acceleration. The experimental results demonstrate the effectiveness of our methods on both autoregressive and autoencoding PLMs, providing novel insights into more efficient utilization of parameter-shared models in resource-constrained settings.


Introduction
In recent years, there has been a significant increase in the number of parameters of PLMs.This began with the advent of BERT (Devlin et al., 2019), containing 340 million parameters, and has escalated to models like T5 (Raffel et al., 2020), GPT-3 (Brown et al., 2020), and PALM (Chowdhery et al., 2022), with the latter reaching an astounding 540 billion parameters.The trend of PLM expansion has, undeniably, improved performance across numerous tasks.Nonetheless, the corresponding increase in computation and storage requirements has raised substantial barriers for scenarios characterized by stringent latency requirements or resource limitations.While PLMs encompassing merely a few billion parameters such as LLaMA (Touvron et al., 2023), Vicuna (Chiang et al., 2023), and Alpaca (Taori et al., 2023) have exhibited remarkable capabilities, their application remains constricted in numerous resource-constrained environments.
In contrast to the monumental advances in PLMs, the real-world applications often still favor more established models such as BERT and GPT-2 (Radford et al., 2019).These models, despite their relatively fewer parameters, deliver satisfactory performance across many tasks while requiring significantly less resources.This balance offers an appealing trade-off between performance and cost.Moreover, parameter sharing techniques have successfully demonstrated that model size can be greatly reduced without significant performance degradation, mitigating the storage burden and yielding better cost-effectiveness.This has sparked an interest in parameter-shared PLMs (PSPLMs) like AL-BERT (Lan et al., 2020), a derivative of the BERT architecture that shares parameters across all layers, effectively reducing model size and memory requirements.Still, it's critical to recognize that parameter sharing alone doesn't guarantee reduced inference time since the number of layers processed during each forward pass remains unchanged.In other words, while it resolves the storage issue, it does not address the computational challenge.
Early exit techniques promise to reduce the number of layers processed during inference by halting computation at early layers (Zhou et al., 2020;Wang et al., 2022;Schuster et al., 2022).While effective, these methods typically require additional trained classifiers or computationally expensive dot products between the vocabulary matrix and the hidden states at each layer.This circumstance prompts the question: Can a method be proposed to reduce the inference cost without introducing extra modules or computations, and could it be complementary to early exit techniques, allowing their combined use for further acceleration?
In this study, we show that the problem can be well addressed in PSPLMs.Specifically, we illustrate how significant acceleration in PSPLM inference can be achieved through our straightforward yet effective technique.This technique, inspired by the principles of neural ODEs, accelerate the inference without necessitating the addition of modules or calculations to the model.Hence, in addition to the inherent storage efficiency, our method notably makes PSPLMs to be computational efficient.We also introduce a pre-training method for PSPLMs with a slightly altered forward propagation rule.Experiments reveal that our proposed pre-training method prepares the model for even greater acceleration during inference, and we give theoretical explanation to aptly support our method.
We further extend the application of our method beyond the domain of fully shared PSPLMs.Our research demonstrates the potential of our acceleration strategy in the context of more complex and capable partially-shared PLMs, and even hints at its applicability to unshared models.This broader applicability shows the flexibility and robustness of our approach.Additionally, our method is not in competition with other acceleration strategies.We demonstrate that our method can be combined orthogonally with early exit techniques, thereby facilitating further acceleration.Remarkably, this synergy of methods makes the partially-shared model surpass its unshared equivalent within an equivalent computational budget.
In essence, our work offers a novel route to accelerate the inference for PSPLMs, and lay a novel foundation for unleashing the PSPLMs' potential, offering critical insights for deployment of PSPLMs in resource-constrained settings.

ODE Perspective on Residual Networks
We begin by providing a brief overview of the relationship between residual networks and ODEs, which forms the fundamental basis of our research.
In a T -layer residual network, we denote the layer t as f θt .The update formulation for the hidden state h can be expressed as: (1) Remarkably, this update scheme aligns with Euler's method for solving ODEs.Consider an ODE:  Euler's method approximates the solution of this ODE from t = 0 to T by dividing the interval into n steps, each with step size s i , such that n−1 i=0 s i = T .The method iteratively computes the following formula from i = 0 to n − 1:

r g e s t e p in f e r e n c e
(3) The final value y n serves as an approximation of the solution to Eq. 2 at time T .The correspondence between Eq. 1 and Eq. 3 is evident.A T -layer residual network can be interpreted as parameterizing the vector field f that characterizes the derivative along the path from the input space to the final output space.The ODE perspective generalizes the concept of depth in residual networks to the continuous domain, where the notion of progression from input to output is captured by the continuous time rather than discrete depth or layer index.
During the inference process of a trained model, the vector field remains fixed because the parameters are frozen.As a result, the model's inference can be seen as solving an ODE within this vector field using Euler's method, where the initial value corresponds to the input embedding, and the solution time is T .Furthermore, the pre-norm Transformer architecture is also a type of residual network (see Appendix A).Therefore, the ODE perspective we have presented can be applied to the pre-norm Transformer architecture as well.

Enlarging the Step Size
In the process of solving the ODE, the choice of the step size s i in Eq. 3 has a significant impact on the speed and accuracy of the solution.Given the final time T , a larger step size reduces the number of iterations, resulting in faster computation but decreased accuracy.This trade-off allows us to

Small step pre-trained model
Step=1 pre-trained model S v H 6 8 q t Y s 8 n i K c w C m c g w e 3 U I M H q E M T C H B 4 g V d 4 c 5 6 d d + f D + Z y 3 F p x 8 5 h g W 4 H z 9 A n + j l q A = < / l a t e x i t > h T < l a t e x i t s h a 1 _ b a s e 6 4 = " l c g g c K U L O Z 8 7 g h W r 5 5 g Y l J a g F n Figure 2: An illustration of the difference between the models pre-trained with different step size.
sacrifice a certain degree of solution accuracy in exchange for inference speed.To expedite model inference, we propose employing a larger step size during inference compared to training (Fig. 1).Specifically, for PSPLMs, the vector field f at any given time t is parameterized by exactly the same set of parameters.Consequently, the time dependence in θ t (as shown in Eq. 1) can be omitted, leading to the forward rule where θ now represents the shared layer parameters.By applying different scaling factors β t > 1 to the original step size s = 1 at different layers, and reducing the number of layers such that t β t ≈ T , the model still mathematically solves the ODE from t = 0 to T , albeit using larger step sizes β t s = β t at each layer.The updated forward rule can be written as: (4) In practice, we perform minimal search to determine a set of suitable {β t } values.In Section 4.2, we will demonstrate that by simply changing the forward rule to Eq. 4, the inference of existing PSPLMs can already be accelerated while maintaining overall performance to a satisfactory extent.

Pre-Training with Smaller Step Size
Ideally, if the approximate result hT obtained using scaled-up step sizes is close to the result h T obtained with the original step size, the overall performance can be greatly preserved.Conventionally, pre-training of PSPLMs is conducted with a step size of s = 1.However, from a theoretical standpoint, the error analysis of Euler's method suggests that selecting a smaller step size s during pre-training may enable better acceleration during inference.Under certain mild conditions, we prove in Appendix B the following inequality: (2) L=24, s=1, 3 sets of parameters < l a t e x i t s h a 1 _ b a s e 6 4 = " y 5 w l 4 X j G l u I s V L J P H R t 2 9 r F / c n 9 e a J 0 U 9 Z X A I j s A p c M E V a I I 7 0 A J t g E E K X s A r e L O e r X f r w / q c r Z a s 4 q Y K 5 m B 9 / Q I G n J s y < / l a t e x i t > U w p 7 n I F q 6 R d q 7 p X 1 c v 7 e q V x l t d T B M f g B J w D F 1 y D B r g D T d A C G D y B F / A K 3 q x n 6 9 3 6 s D 7 n q w U r v z k C C 7 C + f g G H K p r y < / l a t e x i t > 4 23 ⇥

+ t=2
< l a t e x i t s h a 1 _ b a s e 6 4 = " t U N s x L j 8 q + T z l 2 0 E F where β * is the largest scaling factor used across all the layers, and K is a constant determined by the model parameters.This inequality indicates that the difference between hT and h T is bounded to the magnitude of the largest scaled-up step size β * s employed during inference.Assuming that the value of K is approximately the same for models pre-trained with different step scales, then when the step sizes are scaled up by the same factors, the model pre-trained with a smaller s produces a final approximated hidden state that is closer to the hidden state obtained using its original step size.
Empirically, we will show in Section 4.3 that PSPLMs pre-trained with a reasonably small step size can achieve improved performance when reducing the number of iterations during inference.

Generalizing to Partially-Shared PLMs
Shared-parameter models employ the same set of parameters to parameterize the derivative at different discrete time steps during training.This property offers the models the ability to generalize from discrete time to continuous time during inference (Eq.4).On the other hand, unshared models use distinct parameters to parameterize the derivative at each discrete time step, making it challenging to apply a continuous scaling factor β to the step size during inference.For instance, if we use β 0 = 1.3 and s = 1, the model would need to provide the derivative at t = 1.3 at the next iteration, which is unattainable as the unshared model can only provide the derivative at t = 1 using layer 2 or at t = 2 using layer 3, but not at any intermediate time.
However, we will demonstrate that pre-training a partially-shared PLM with time-dependent parame-ters represented by a piece-wise linear function can enhance the model's capabilities while benefiting from the accleration method we introduce in Section 3.1.Given a language model with L layers and step size s, and n (2 ≤ n ≤ L) sets of layer parameters, denoted as θ = {θ 0 , θ 1 , ..., θ n−1 }, we uniformly position the n sets of parameters within the range from 0 to (L − 1)s, with each interval spanning ∆ = (L − 1)s/(n − 1).
To determine the parameters at a specific time, denoted as t, we define the function P (t) that returns the parameters at time t.We first identify the left and right boundary indices of the interval in which t resides, denoted as l and r, respectively.Subsequently, we perform linear interpolation between θ l and θ r to obtain the parameters at time t.P (t) can be formally written as: Fig. 3 shows an example of this process.The model is referred to as partially-shared since it does not share all layer parameters; instead, it shares a set of parameters and uses interpolation to obtain the parameters at different time.Notably, when n = L, it becomes the unshared model, and the fullyshared model is a special case if we allow n = 1.By learning to use linear interpolation to derive parameters for different time steps during training, the model can naturally generalize to the continuous domain and provide derivatives for any time step during inference.We will demonstrate at Section 4.4 that these partially-shared PLMs exhibit notable advantages, including better or comparable performance to their unshared counterparts, as well as the ability to enable accelerated inference through a scaled-up step size.

Experimental Setups
We investigate the effectiveness of our suggested inference acceleration technique on both autoregressive and autoencoding models.Specifically, we pretrain GPT-2 large models and pre-norm BERT large models both with shared parameters under diverse settings, as elucidated in the subsequent sections.All GPT-2 models are pre-trained on the OpenWeb-Text dataset (Radford et al., 2019), and all BERT models are pre-trained on the Pile dataset (Gao et al., 2021).Detailed information on hyperparameters and pre-training configurations can be found in Appendix D. It is important to mention that while we exclusively focus on parameter sharing among layers, our proposed method can be seamlessly incorporated alongside other parameterreduction techniques such as embedding factorization used in ALBERT.
For downstream task evaluation, we measure the zero-shot perplexity (PPL) on Wikitext-103 (Merity et al., 2017), and zero-shot accuracy on LAM-BADA (Paperno et al., 2016) for GPT-2 models.And as for BERT models, they are fine-tuned on different tasks including MNLI, SST-2 (Wang et al., 2019), RACE (Lai et al., 2017), SQuAD and SQuAD2.0(Rajpurkar et al., 2016) separately.Configuration details and metrics for these downstream tasks can be found in Appendix E. During the inference, we experiment with different iteration counts and for each count, we perform a minimal search on the β for each layer within the set {1.0, 1.1, . . ., 3.0} using Optuna (Akiba et al., 2019), and report the best results.Unless explicitly stated otherwise, both BERT and GPT-2 models mentioned hereafter are parameter-shared.

Inherent Highways: Scaling Up Step Sizes
As described in Section 3.1, from the perspective of ODEs, we can naturally accelerate PSPLMs by increasing the step sizes.In other words, there may be inherent highways in PSPLMs, and we may utilize them by increasing the step size and decreasing the number of iterations.
To validate the presence of these inherent highways in PSPLMs, we pre-train GPT-2 large and BERT large models under the conventional setting (i.e., s = 1), and evaluate their inference performance on a variety of downstream tasks with different iteration counts and step sizes.The results are shown in Table 1.Additionally, we compute relative changes in performances for reduced iterations as , where p represents the performance, and we report these values in parentheses.
Our experimental results reveal that a clever reduction in the iteration count presents an opportunity for substantial computational savings while maintaining most of the model performance.When the iteration count decreases from 24 to 20, the performance impact across all datasets is virtually negligible.For BERT, variations in performance are consistently contained within a margin of ± 0.3% across all tasks.Simultaneously, the GPT-2 model shows only a slight increase in perplexity on Wikitext-103 from 33.0 to 33.5.Even with a further reduction in iteration count to 16, the models continue to deliver respectable performance.
For BERT, the majority of tasks report a minimal performance decrease, with the highest decrease appearing in the RACE task at -2.2%.For GPT-2 model, although the perplexity on Wikitext-103 increases to 35.3 and the accuracy on LAMBADA decreases to 30.9, the performance still stays within an acceptable range.
Overall, these results suggest that the step size per iteration can be scaled up and leads to a reduction in the number of iterations in conventional PSPLMs without a significant compromise on performance.In essence, our approach enables a computational reduction by approximately 1/3 to 1/6.However, further reductions in iteration does result in performance degration, which can be addressed in the subsequent section.

Acceleration Boost: Mini-Step Pretraining
In Section 3.2, we posited that pre-training PSPLMs with small step sizes may make the models more conducive to acceleration during the inference.This section provides empirical validation of these theoretical insights.

Performance Across Downstream Tasks
For a fair comparison, we maintain identical pretraining configurations and data and train 4 models with step size 1, 0.1, 0.05, and 0.01 respectively.The performance is presented in Fig. 4, where we have several noteworthy observations: Small step sizes do not detrimentally affect performance within a reasonable range.We first look at the performances when the iteration count is not reduced (24 in the figures).BERT models pretrained with smaller step sizes demonstrate comparable, and in some instances superior, performance on various downstream tasks in comparison to the conventionally pretrained BERT.The only exception is the MNLI task, where the latter model performs marginally better.However, it should still be noted that extremely small step size of 0.01 still negatively impacts the model's performance across all tasks.But overall, a reasonably small step size does not impact the model's capacity.
Small step sizes enhance performance retention when reducing the iteration count.Looking at the performance at 12 iterations across models with varying step sizes, as expected, we generally observed a decline in comparison to the performance attained at 24 iterations.However, models with smaller step sizes exhibit remarkably better performance retention.Particularly noteworthy is GPT models pretrained with small step sizes, as they exhibit a significantly better zero-shot performance retention on both the LAMBADA and Wikitext-103 tasks.Notably, as we do not inroduce any additional computational overhead, the speedups of BERT and GPT are the same as those reported in Table 1, which is almost linear to the reduced number of iteration, while the performance retention is significantly improved.
Reducing iteration count enhances performance on certain datasets.This finding aligns with observations made in earlier studies on early exits, suggesting that preventing models from overthinking can enhance both accuracy and robustness (Zhou et al., 2020;Balagansky and Gavrilov, 2022).However, unlike these previous studies, our approach demonstrates that we can effectively and easily prevent overthinking for models pretrained with smaller step sizes without auxiliary modules.

Analyzing the Possible Mechanism
We further explore why models pretrained with a small step size result in PSPLMs that are more efficiently accelerated during inference.Our analysis reveals two main advantages for models pretrained with smaller step sizes: Reduced absolute and relative difference when the iteration count is decreased.We decrease the iteration count for all models to 20, 16, and 12, use the searched step scales and calculate the absolute difference, denoted as ∥h T − hT ∥, and the relative difference, expressed as ∥h T − hT ∥/∥h T ∥.Here we keep the notations consistant with Eq. 5.These values represent the difference between the approximated and the original final hidden states.
As demonstrated in Fig. 10, the final hidden state approximation from a model pre-trained with a smaller step size presents a closer resemblance to the original final hidden state, both in terms of absolute and relative difference.These observations suggest that when the number of iterations is decreased, models pre-trained with smaller step sizes could yield results that align more closely with those from models with unreduced iterations on most tasks, thereby better preserving performance.
Enhanced smoothness in the vector field.To further our analysis, we compute the cosine similarity between the derivatives f (x) produced by the model at two consecutive iterations, denoted as CosSim (f (h i ), f (h i−1 )) , i ∈ 1, 2, . . ., 23.The results are plotted in Fig. 5.
Figs. 5 and 8 reveal an increasing trend of cosine similarity as the layer index increases, with smaller step size generally resulting in higher cosine similarity in the early layers.Although a step size of 0.1 also appears to have lower similarities for the first few layers, there is a swift increase as the layer index increases.The cosine similarities for step sizes of 0.01 and 0.05 consistently remain over 0.8 across all layers, suggesting an almost parallel alignment of the derivatives at different time, that is, a smoother vector field.In other words, the paths from the input embedding to the final output are more "straight" for models pre-trained with small step size, thus allowing us to reduce the number of iteration and enlarge the step size during inference.

Expanding Horizons: Partial Sharing
In this section, we pre-train partially-shared PLMs, and apply our method described in Section 3.3 to them.This experimental validation substantiates the possibility of inference acceleration in more complex, partially-shared, and even unshared models.
For BERT large and GPT-2 large , we conduct pretraining with n = 12 sets of parameters and step sizes of 0.1 and 0.05, respectively.We establish baselines by pre-training unshared BERT and GPT models using the equivalent configurations, except that the total number of layers is set to 12.This ensured the same number of parameters between our partially-shared models and the baseline models.
The results of downstream tasks are illustrated in Figs. 6 and 9.The unshared model performance is represented by the red dashed line in each figure.As anticipated, partially-shared models, benefiting from the increased number of parameters, signifi- Our focus, however, is on the performance post reduction in iteration count.At 14 iterations, BERT pre-trained with a step size of 0.1 either surpasses or matches the unshared 12-layer baseline across all tasks.However, when the iteration count is further decreased to 12, a performance drop is observed, making it marginally underperform the baseline.Nonetheless, it still achieves over 98% of the baseline performance in most tasks.As for GPT, the model pre-trained with a step size of 0.05 exhibits impressive performance retention on LAM-BADA across all iteration counts, consistently beating the unshared baseline.Although perplexity on Wikitext-103 rises with reduced iterations, it remains at an acceptable level.
It is crucial to underline that this iteration reduction is achieved without additional training post pre-training and fine-tuning.While the partiallyshared model may lag behind the unshared baseline in some tasks post-reduction, it stays competitive, which is a non-trivial achievement considering that we merely increased the step size during inference.We have also tried the n = 24 setting, which is equivalent to the unshared 24-layer model, the performance retention is satisfactory when the scaling factors are integers.We place the results and analysis in Appendix G.

Rapid Inference: Early Exit Integration
This section aims to demonstrate the potential for combining our approach with the early exit technique to further enhance model performance under reduced iteration counts.We adopt the strategy employed by DeeBERT (Xin et al., 2020), in which internal classifiers are trained at every layer of a frozen, fine-tuned BERT model to predict the final label.Similarily, we train classifiers on the partially-shared BERT that has been pre-trained with s = 0.1.See Appendix F for more details.
Initially, we conduct a step scale search for the model at 24, 20, and 16 iterations, as we have done in Section 4.4.Subsequently, classifiers are trained on these models with their respective iteration counts and searched step scales.During inference, the entropy of the prediction distribution at each layer guide the decision to halt.By adjusting the entropy thresholds, we can manage the trade-off between performance and efficiency.
The results are presented in Fig. 7. Evidently, when the DeeBERT technique is applied to models with reduced iteration counts of 16 and 20, it succeeds in outperforming the unshared 12-layer model under the equivalent computational budget.This finding indicates that the early exit strategy and reduction of iteration count using large step scales are indeed synergistic.Their integration effectively bolsters performance retention, yielding a remarkable inference acceleration.

Related Work
PSPLMs.The increasing size and memory usage of PLMs have prompted research efforts focused on parameter sharing in these models.Several approaches have been proposed, demonstrating the potential to maintain comparable performance while significantly reducing the model size.Universal Transformers (Dehghani et al., 2019) and ALBERT (Lan et al., 2020) share parameters across all layers.Takase and Kiyono (2021) propose to share parameters for every two consecutive layers or share layer parameters cyclically, while Xue et al. (2022) propose to share all layer parameters except bias terms and layer normalization modules.These advanced strategies enhance model capacity at the cost of increased parameter number.However, none of these methods reduce the computational cost during inference: the computations required for the inference of shared and unshared PLMs are still identical.
Early Exit.The efficiency of inference in PLMs has become a significant concern for deployment, leading to extensive research efforts focused on inference acceleration.Early exit techniques aim to terminate the inference process in early layers and bear close relevance to our work.Many early exit methods necessitate an internal classifier to be applied to the intermediate hidden states of the early layers, thus requiring joint training with the PLMs themselves (Zhou et al., 2020;Wang et al., 2022), or training as a separate stage with the PLMs held frozen (Xin et al., 2020;Liu et al., 2020).Our method distinguishes itself by reducing the number of layers during inference without the need for an additional classifier that requires training.Also, our method is complementary to early exit technique, and can be jointly leveraged to accelerate the inference.
Neural ODEs.The connection between residual networks and ordinary differential equations has been extensively explored in prior research (E, 2017), where different designs of residual networks can be linked to diverse numerical discretizations of ODEs (Chen et al., 2018;Lu et al., 2018).Neural ODEs extend the concept of residual networks to continuous-depth architectures.In our work, we build upon the ODE perspective of residual networks and propose to accelerate the PSPLMs by increasing step size, and from the error analysis of Euler's method, we propose a simple pre-training technique to enable further inference acceleration.
Hyper-Networks.We adopt the linear interpolation of a piece-wise linear function as parameters for different layers to build partially-shared PLMs.This bears resemblance to hypernetworks (Ha et al., 2017), where the parameters of a neural network are generated by a specific function or another neural network.The parameterization of model parameters in a hypernetwork style has found wide application in various domains, including neural architecture search (Brock et al., 2018), meta-learning (Requeima et al., 2019), andneural ODEs (Chen et al., 2018).

Conclusion
In this study, we draw inspiration from the ODE perspective on residual networks.Our research proposes straightforward strategies to expedite the inference process for both fully and partially-shared PLMs.The results of our work reveal that PSPLMs can not only reduce the storage and memory costs, but also reduce the time costs.Furthermore, when our approach is coupled with the early exit technique, the partially-shared PLMs demonstrate superior performance compared to unshared models under the same computational budget.We believe that our methodology harbors substantial potential, particularly in the acceleration of inference in unshared PLMs -a promising avenue for future research.We anticipate the extension of our tech-niques to the acceleration of large language models encompassing billions of parameters, and look forward to further explorations in this field.

Local Truncation
Error.Now we derive the local truncation error T of Euler's method, which is essentially the discrepancy between the real state h(t 1 ) and the one-step approximated state h t 1 start from h(t 0 ).It is formally written as: We can further derive the h ′′ (t 0 ) by differentiating h ′ (t 0 ) = f (h(t 0 ), t 0 ): Assume that f and its derivatives f h (h(t 0 ), t 0 ) and f t (h(t 0 ), t 0 ) are all continuous and bounded, then there exists a constant M so that Taking Eq. 13 into Eq.11, then we can establish the local truncation error of Euler's method: Since M is a constant, the local truncation error of Euler's method is of order O(s 2 ), i.e., the square of the step size.
Global Truncation Error.The global truncation error is the accumulated error from initial time t 0 to final time T .To derive the global truncation error, we first define e t i = h(t i ) − h t i as the difference between the real state h(t i ) and approximated state h t i at time t i .Recall that we have where Eq. 16 is the Euler's method, and Eq. 15 is the Euler's method with local truncation error term.Substracting Eq. 16 from Eq. 15, we have From Eq. 14, we have |T t i | ≤ M 2 s 2 .By substituting it into Eq.17 and applying the absolute value inequality, we can see that Because the assumption of f and its derivative being continuous and bounded, according to the mean value theorem, we have where h * t i is some number between h(t i ) and where R is some constant.By substituting Eq. 20 into Eq.18, we obtain the relation between the errors of two consecutive steps: For simplicity, let C = (1 + sR).We can then iterative apply the inequality starting from t = t 0 and e t 0 = h(t 0 ) − ht 0 = 0 as and since , we then obtain the error bound of the approximated final result which is of order O(s), i.e., linear to the step size.As for the difference between the final approximated state hT obtained using larger step size β and the state h T obtained using the original step size s, since |e Therefore the difference between h T and hT is also bounded, and is linear to step size s.AdamW (Loshchilov and Hutter, 2019) as the optimizer.The model configurations for our BERT and GPT-2 are kept the same as BERT large and GPT-2 large respectively.

E Configurations for the Downstream Tasks
We adopt the approach employed by Megatron-LM framework for handling MNLI and RACE tasks.
For the classification tasks MNLI and SST-2, we utilize the hidden state of the [CLS] token for classification and report accuracy on the development set.In the RACE task, we predict the probability of each answer using the [CLS] token's representation and report test set accuracies.Regarding the SQuAD v1.1 and v2.0 tasks, we adhere to BERT's training procedure, applying a span extraction loss, and record the F1 score on the development set using the official evaluation script3 .
During fine-tuning BERT on all the downstream tasks, we use the linear learning rate warmup and decay schedule.The gradient norm is constrained to 1.0.We apply a dropout rate of 0.1 and a weight decay of 0.01.Further hyperparameter details are documented in Table 4.
For GPT tasks, we adopt a zero-shot approach.The performance on the LAMBADA task is assessed using cloze accuracy, which involves predicting the last word (not the last token) based on the preceding tokens.Performance on Wikitext-103 is measured using the perplexity metric on the test set.

F Configurations for the Early Exit
We employ the methodology from DeeBERT (Xin et al., 2020) for the early exit experiment.During the training phase, we keep the BERT model and the original classification head fixed and only train the additional classifiers on each layer.The training loss is computed as the sum of the losses from all additional classifiers.In the inference phase, we calculate the entropy of the output logits at each layer.If the entropy value falls below a predetermined threshold at a layer, we halt the computation and take the prediction at this layer as the final prediction.We record the average number of iterations at the point of output across all instances.The thresholds utilised in this experiment are as follows: [0, 0.01, 0.05, 0.07, 0.1, 0.2, 0.3, 0.4, 0.5].Higher thresholds induce an earlier exit.

G Performance under n = 24 Setting
In this section, we extend our partially-shared model to include n = 24 sets of parameters, which renders it equivalent to an unshared 24-layer model.The only difference is that we use the same initialization for the 24 sets of parameters.This attempt is only made on the GPT-2 large model, and the cor-responding results are presented in Table 5.
On close examination, we note a peculiar trend: as the iteration count reduces from 24 to 12, the zero-shot perplexity on the Wikitext-103 dataset first increases, and then decreases.This anomaly could be attributed to the model's inability to learn the usage of linear interpolation between the endpoint parameters for derivative calculation when n = 24.That is, when n = 24, the linear interpolation (Eq.6) always returns the parameters on one of the endpoint.Consequently, during the inference phase, as we increase the step size, the parameters derived through linear interpolation start deviating from the parameters utilized by the model during training.This divergence is potentially responsible for a significant degradation in the model's performance.
Interestingly, when the scaling factor β for each iteration is adjusted to 2 and the iteration count is reduced to 12, the model yields a reasonable performance.We hypothesize this is due to the fact that when scaling factors are integers, the parameters derived remain endpoint parameters, which the model has been accustomed to handle during the training phase.However, the experiment still highlight the potential of our method when applying to the unshared model.

A
re a o f C o rr e c t P re d ic ti o n < l a t e x i t s h a 1 _ b a s e 6 4 = " y 5/ t W o k y C 6 4 Y m a q e z o W R W p Y L 2 m g = " > A A A B / 3 i c b V C 7 S g N B F L 0 b X z G + o p Y 2 g 0 G w k L A r 8 V E G b C w j m A c k S 5 i d z C Z D Z m a X m V k h L F v 4 C 7 b a 2 4 m t n 2 L r l z h J t j C J B y 4 c z r m X e + 8 J Y s 6 0 c d 1 v p 7 C 2 v r G 5 V d w u 7 e z u 7 R + U D 4 9 a O k o U o U 0 S 8 U h 1 A q w p Z 5 I 2 D T O c d m J F s Q g 4 b Q f j u 6 n f f q J K s 0 g + m k l M f Y G H k o W M Y G O l T i 8 Q 6 S j r u / 1 y x a 2 6 M 6 B V 4 u W k A j k a / f J P b x C R R F B p C M d a d z 0 3 N n 6 K l W G E 0 6 z U S z S N M R n j I e 1 a K r G g 2 k 9 n 9 2 b o z C o D F E b K l j Ro p v 6 d S L H Q e i I C 2 y m w G e l l b y r + 5 3 U T E 9 7 6

Figure 1 :
Figure 1: An illustration of reducing the iteration count during inference by enlarging the step size.

A
re a o f C o rr e ct P re d ic ti o n < l a t e x i t s h a 1 _ b a s e 6 4 = " y 5 7 z C m / P s v D s f z u e 8 t e D k M 8 e w A O f r F 0 n 7 l n 4 = < / l a t e x i t > t=0 < l a t e x i t s h a 1 _ b a s e 6 4 = " y 5 7 z C m / P s v D s f z u e 8 t e D k M 8 e w A O f r F 0 n 7 l n 4 = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " 5 9 u a t F p V 0 O I T d n b 3 9

Figure 3 :
Figure 3: An illustration of our partially-shared model.The example shows how a model with number of layers L = 24, step size s = 1 and n = 3 sets of parameters determines the layer parameters for t = 2.

Figure 4 :
Figure 4: The inference performance of parameter-shared models pre-trained with different step size.(a-e) The accuracy of BERT on MNLI, SST-2 and RACE, and F1 score on SQuAD and SQuAD2.0.(f-g) The zero-shot accuracy of GPT-2 on LAMBADA, and the zero-shot perplexity of GPT-2 on Wikitext-103.

Figure 5 :Figure 6 :
Figure5: The cosine similarity between the derivatives given by the model at two consecutive iterations.

Figure 7 :
Figure 7: Performance of early-exit BERT on MNLI.The black dashed line represents partially-shared BERT pre-trained with s = 0.1, while the red dashed line denotes the unshared 12-layer BERT.

Figure 8 :Figure 9 :
Figure8: The cosine similarity between the derivatives given by the model at two consecutive iterations.

Table 1 :
Inference performance of PSPLMs pre-trained with step size 1.Values in parentheses indicates relative change from non-reduced iteration counts.Speed reflects the acceleration of the forward pass wallclock time.

Table 2 :
Hyper-parameters for pre-training all the BERT models in this work.

Table 3 :
Hyper-parameters for pre-training all the GPT-2 models in this work.

Table 4 :
Hyper-parameters for downstream tasks with different step sizes.

Table 5 :
Zero-shot perplexity on Wikitext-103 of the partially-shared model with n = 24.