Enhancing Scalability of Pre-trained Language Models via Efficient Parameter Sharing

,


Introduction
Recently, pre-trained language models (PLMs) have achieved huge success in a variety of NLP tasks by exploring ever larger model architecture (Raffel et al., 2020;Radford et al., 2019).It has been shown that there potentially exists a scaling law between the model size and model capacity for PLMs (Kaplan et al., 2020), attracting many efforts to enhance the performance by scaling model size (Chowdhery et al., 2022;Wang et al., 2022b).
As a straightforward approach, we can directly increase the layer number of networks for improving the model capacity (Wang et al., 2022b; Huang et al., 2020).While, a very deep architecture typically corresponds to a significantly large model size, leading to high costs in both computation and storage (Gong et al., 2019).And, it is difficult to deploy deep networks in resource-limited settings, though it usually has a stronger model capacity.Therefore, there is an urgent need for developing a parameter-efficient way for scaling the model depth.
To reduce the parameters in deep networks, weight sharing has proven to be very useful to design lightweight architectures (Zhang et al., 2022;Lan et al., 2019).As a representative one by across-layer parameter sharing, ALBERT (Lan et al., 2019) keeps only ten percent of the whole parameters of BERT while maintaining comparable performance.Although the idea of parameter sharing is simple yet (to some extent) effective, it has been found that identical weights across different layers are the main cause of performance degradation (Zhang et al., 2022).To address this issue, extra blocks are designed to elevate parameter diversity in each layer (Nouriborji et al., 2022).While they still use the rigid architecture of shared layer weights, having a limited model capacity.Besides, it is difficult to optimize very deep models, especially when shared components are involved.Although recent studies (Wang et al., 2022b;Huang et al., 2020) propose improved initialization methods, they do not consider the case with parameter sharing, thus likely leading to a suboptimal performance on a parameter-sharing architecture.
To address these challenges, in this paper, we propose a highly parameter-efficient approach to scaling PLMs to a deeper model architecture.As the core contribution, we propose a matrix product operator (MPO) based parameter-sharing architecture for deep Transformer networks.Via MPO decomposition, a parameter matrix can be decomposed into central tensors (containing the major information) and auxiliary tensors (containing the supplementary information).Our approach shares the central tensors of the parameter matrices across all layers for reducing the model size, and meanwhile keeps layer-specific auxiliary tensors for enhancing the adaptation flexibility.In order to train such a deep architecture, we propose an MPObased initialization method by utilizing the MPO decomposition results of ALBERT.Further, for the auxiliary tensors of higher layers (more than 24 layers in ALBERT), we propose to set the parameters with scaling coefficients derived from theoretical analysis.We theoretically show it can address the training instability regardless of the model depth.
Our work provides a novel parameter-sharing way for scaling model depth, which can be generally applied to various Transformer-based models (Zhao et al., 2023).We conduct extensive experiments to evaluate the performance of the proposed model on the GLUE benchmark in comparison to PLMs with varied model sizes (tiny, small and large).Experiments results have demonstrated the effectiveness of the proposed model in reducing the model size and achieving competitive performance.With fewer parameters than BERT BASE , we scale the model depth by a factor of 4x and achieve 0.1 points higher than BERT LARGE for GLUE score.

Related Work
Matrix Product Operators.Matrix product operators (a.k.a.tensor-train operators (Oseledets, 2011)) were proposed for a more effective representation of the linear structure of neural networks (Gao et al., 2020a), which was used to compress deep neural networks (Novikov et al., 2015), convolutional neural networks (Garipov et al., 2016;Yu et al., 2017), and LSTM (Gao et al., 2020b;Sun et al., 2020a).Based on MPO decomposition, recent studies designed lightweight finetuning and compression methods for PLMs (Liu et al., 2021), developed parameter-efficient MoE architecture (Gao et al., 2022), over-parametrization PLMs (Gao et al., 2023) and empirical study the emergency ability in quantized large language models (Liu et al., 2023).Unlike these works, our work aims to develop a very deep PLM with lightweight architecture and stable training.
Parameter-Efficient PLMs.Existing efforts to reduce the parameters of PLMs can be broadly categorized into three major lines: knowledge distillation, model pruning, and parameter sharing.For knowledge distillation-based methods (Sanh et al., 2019;Sun et al., 2020b,b;Liu et al., 2020;Wang et al., 2022a), PLMs were distilled into student networks with much fewer parameters.For pruningbased methods, they tried to remove less important components (Michel et al., 2019;Wang et al., 2020) or very small weights (Chen et al., 2020).Moreover, the parameter-sharing method was further proposed by sharing all parameters (Lan et al., 2019) or incorporating specific auxiliary components (Reid et al., 2021;Nouriborji et al., 2022).Different from these works, we design an MPObased architecture that can reduce the model size and enable adaptation flexibility, by decomposing the original matrix.
Optimization for Deep Models.Although it is simple to increase the number of layers for scaling model size, it is difficult to optimize very deep networks due to the training instability issue.Several studies have proposed different strategies to overcome this difficulty for training deep Transformer networks, including Fixup (Zhang et al., 2019) by properly rescaling standard initialization, T-Fixup (Huang et al., 2020) by proposing a weight initialization scheme, and DeepNorm (Wang et al., 2022b) by introducing new normalization function.As a comparison, we study how to optimize the deep MPO-based architecture with the parameter sharing strategy, and explore the use of well-trained PLMs for initialization, which has a different focus from existing work.

Method
In this section, we describe the proposed MPOBERT approach for building deep PLMs via a highly parameter-efficient architecture.Our ap-proach follows the classic weight sharing paradigm while introducing a principled mechanism for sharing informative parameters across layers and also enabling layer-specific weight adaptation.

Overview of Our Approach
Although weight sharing has been widely explored for building compact PLMs (Lan et al., 2019), existing studies either share all the parameters across layers (Lan et al., 2019) or incorporate additional blocks to facilitate the sharing (Zhang et al., 2022;Nouriborji et al., 2022).They either have limited model capacity with a rigid architecture or require additional efforts for maintenance.
Considering the above issues, we motivate our approach in two aspects.Firstly, only informative parameters should be shared across layers, instead of all the parameters.Second, it should not affect the capacity to capture layer-specific variations.To achieve this, we utilize the MPO decomposition (Liu et al., 2021) to develop a parameterefficient architecture by sharing informative components across layers and keeping layer-specific supplementary components (Section 3.2).As another potential issue, it is difficult to optimize deep PLMs due to unstable training (Wang et al., 2022b), especially when weight sharing (Lan et al., 2019) is involved.We further propose a simple yet effective method to stabilize the training of PLMs (Section 3.3).Next, we introduce the technical details of our approach.

MPO-based Transformer Layer
In this section, we first introduce the MPO decomposition and introduce how to utilize it for building parameter-efficient deep PLMs.

MPO Decomposition
Given a weight matrix W ∈ R I×J , MPO decomposition can decompose a matrix into a product of n tensors by reshaping the two dimension sizes I and J: where we have For simplicity, we omit the bond dimensions in Eq. ( 4).When n is odd, the middle tensor contains the most parameters (with the largest bond dimensions), while the parameter sizes of the rest decrease with the increasing distance to the middle tensor.Following Gao et al. (2022), we further simplify the decomposition results of a matrix as a central tensor C (the middle tensor) and auxiliary tensors {A i } n−1 i=1 (the rest tensor).As a major merit, such a decomposition can effectively reorganize and aggregate the information of the matrix (Liu et al., 2021): central tensor C can encode the essential information of the original matrix, while auxiliary tensors {A i } n−1 i=1 serve as its complement to exactly reconstruct the matrix.

MPO-based Scaling to Deep Models
Based on MPO decomposition, the essence of our scaling method is to share the central tensor across layers (capturing the essential information) and keep layer-specific auxiliary tensors (modeling layer-specific variations).Fig. 2 shows the overview architecture of the proposed MPOBERT.
Cross-layer Parameter Sharing.To introduce our architecture, we consider a simplified structure of L layers, each consisting of a single matrix.With the five-order MPO decomposition (i.e., n = 5), we can obtain the decomposition results for a weight matrix (W (l) ), denoted as i=1 are the central tensor and auxiliary tensors of the l-th layer.Our approach is to set a shared central tensor C across layers, which means that As shown in Appendix A.1, the central tensor contains the major proportion of parameters (more than 90%), and thus our method can largely reduce the parame-ters when scaling a PLM to very deep architecture.We implement our proposed efficient parametersharing strategy upon the BERT (Devlin et al., 2018), named MPOBERT, which shares the central tensor across layers.Note that this strategy can be easily applied to multiple matrices in a Transformer layer, and we omit the discussion for the multi-matrix extension.Another extension is to share the central tensor by different grouping layers.We implement a layer-grouping strategy upon the BERT (Devlin et al., 2018), named MPOBERT + , which divides the layers into multiple parts and sets unique shared central tensors in each group.
Layer-specific Weight Adaptation.Unlike AL-BERT (Lan et al., 2019), our MPO-based architecture enables layer-specific adaptation by keeping layer-specific auxiliary tensors ({A ).These auxiliary tensors are decomposed from the original matrix, instead of extra blocks (Zhang et al., 2022).They only contain a very small proportion of parameters, which does not significantly increase the model size.While, another merit of MPO decomposition is that these tensors are highly correlated via bond dimensions, and a small perturbation on an auxiliary tensor can reflect the whole matrix (Liu et al., 2021).If the downstream task requires more layer specificity, we can further incorporate low-rank adapters (Hu et al., 2021) for layer-specific adaptation.Specifically, we denote Adapter as the low-rank adapter for W (l) .In this way, W (l) can be formulated as a set of tensors: Adapter }.The parameter scale of adapters, L × r × d total , is determined by the layer number L, the rank r, and the shape of the original matrix (d total = d in + d out is the sum of the input and output dimensions of a Transformer Layer).Since we employ low-rank adapters, we can effectively control the number of additional parameters from adapters.

Stable Training for MPOBERT
With the above MPO-based approach, we can scale a PLM to a deeper architecture in a highly parameter-efficient way.However, as shown in prior studies (Lan et al., 2019;Wang et al., 2022b), it is difficult to optimize very deep PLMs, especially when shared components are involved.In this section, we introduce a simple yet stable training algorithm for MPO-based PLM and then discuss how it addresses the training instability issue.

MPO-based Network Initialization
Existing work has found that parameter initialization is important for training deep models (Huang et al., 2020;Zhang et al., 2019;Wang et al., 2022b), which can help alleviate the training instability.To better optimize the scaling PLMs, we propose a specially designed initialization method based on the above MPO-based architecture.
Initialization with MPO Decomposition.Since MPO-based architecture shares global components (i.e., the central tensor) across all layers, our idea is to employ existing well-trained PLMs based on weight sharing for improving parameter initialization.Here, we use the released 24-layer AL-BERT with all the parameters shared across layers.The key idea is to perform MPO decomposition on the parameter matrices of the ALBERT model and obtain the corresponding central and auxiliary tensors.We first divide the model into several groups by structure (embedding, attention, and feed-forward network).Then, for each group, We initialize central tensors by the derived central tensors from the MPO decomposition results of ALBERT.Since they are globally shared, one single copy is only needed for initialization regardless of the layer depth.Next, for auxiliary tensors, we directly copy the auxiliary tensors from the MPO decomposition results of ALBERT.
Scaling the Initialization.A potential issue is that ALBERT only provides a 24-layer architecture, and such a strategy no longer supports the initialization for an architecture of more than 24 layers (without corresponding auxiliary tensors).As our solution, Inspired by the idea in Wang et al. (2022b) that avoids the exploding update by incorporating an additional scaling coefficient and multiplying the randomly initialized values for the auxiliary tensors (those in higher than 24 layers) with a coefficient of (2L) − 1 4 , where L is the layer number.Then, we present a theoretical analysis of training stability.

Theoretical Analysis
To understand the issue of training instability from a theoretical perspective, we consider a Transformer-based model F (x, W) with x and W as input and parameters, and consider one-step update △F 1 .According to Wang et al. (2022b), a large model update (△F ) at the beginning of training is likely to cause the training instability of deep Transformer models.To mitigate the exploding update problem, the update should be bounded by a constant, i.e., ∥△F ∥ = O(1).Next, we study how the △F is bounded with the MPOBERT.
MPO-based Update Bound.Without loss of generality, we consider a simple case of low-order MPO decomposition: n = 3 in Eq. ( 4).Following the derivation method in Wang et al. (2022b), we simplify the matrices W, A 1 , C and A 2 to scalars w,u,c,v, which means the parameter w l at the l-th layer can be decomposed as Based on these notations, we consider L-layer transformer-based model F (x, w)(w = {w 1 , w 2 , ..., w L }), where each sub-layer is normalized with Post-LN: Then we can prove ∥△F ∥ satisfies (see Theorem A.1 in the Appendix): The above equation bounds the model update in terms of the central and auxiliary tensors.Since central tensors (c l ) can be initialized using the pretrained weights, we can further simplify the above bound by reducing them.With some derivations (See Corollary A.2 in the Appendix), we can obtain ) in order to guarantee that ∥△F ∥ = O(1).For simplicity, we set to bound the magnitude of each update independent of layer number L. In the implementation, we first adopt the Xavier method for initialization, and then scale the parameter values with the coefficient of (2L) − 1 4 .
Comparison.Previous research has shown that using designed values for random initialization can improve the training of deep models (Huang et al., 2020;Zhang et al., 2019;Wang et al., 2022b).These methods aim to improve the initialization of general transformer-based architectures for training from scratch.As a comparison, we explore the use of pre-trained weights and employ the MPO decomposition results for initialization.In particular, Gong et al. (2019) have demonstrated the effectiveness of stacking pre-trained shallow layers for deep models in accelerating convergence, also showing performance superiority of pre-trained weights over random initialization.

Training and Acceleration
To instantiate our approach, we pre-train a 48layer BERT model (i.e., MPOBERT 48 ).For a fair comparison with BERT BASE and BERT LARGE , we adopt the same pre-training corpus (BOOKCOR-PUS (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2018)) and pre-training tasks (masked language modeling, and sentence-order prediction).We first perform MPO decomposition on the weights of ALBERT and employ the initialization algorithm in Section 3.3.1 to set the parameter weights.During the training, we need to keep an updated copy of central tensors and auxiliary tensors: we optimize them according to the pretraining tasks in an end-to-end way and combine them to derive the original parameter matrix for forward computation (taking a relatively small cost of parallel matrix multiplication).
Typically, the speed of the pre-training process is affected by three major factors: arithmetic bandwidth, memory bandwidth, or latency.We further utilize a series of efficiency optimization ways to accelerate the pre-training, such as mixed precision training with FP16 (reducing memory and arithmetic bandwidth) and fused implementation of activation and normalization (reducing latency).Finally, we can train the 48-layer MPOBERT at a time cost of 3.8 days (compared with a nonoptimized cost of 12.5 days) on our server configuration (8 NVIDIA V100 GPU cards and 32GB memory).More training details are can be found in the experimental setup Section 4.1 and Appendix A.3 (Table 6 and Algorithm 2).

Experiments
In this section, we first set up the experiments and then evaluate the efficiency of MPOBERT on a variety of tasks with different model settings.

Experimental Setup
Pre-training Setup.For the architecture, we denote the number of layers as L, the hidden size as H, and the number of self-attention heads as A. We report results on four model sizes: MPOBERT 12 (L=12, H=768, A=12), MPOBERT 48 (L=48, H=1024, A=16) and MPOBERT 48+ that implement cross-layer parameter sharing in three distinct groups as discussed in subsection 3.2.2.We pre-train all of the models with a batch size of 4096 for 10k steps.Fine-tuning Datasets.To evaluate the performance of our model, we conduct experiments on the GLUE (Wang et al., 2018) and SQuAD v1.1 (Rajpurkar et al., 2016) benchmarks.Since fine-tuning is typically fast, we run an exhaustive parameter search and choose the model that performs best on the development set to make predictions on the test set.We include the details in the Appendix(see Appendix A.4.1 for the datasets and Appendix A.4.2 for evaluation metrics) Baseline Models.We compare our proposed MPOBERT to the existing competitive deep PLMs and parameter-efficient models.In order to make fair comparisons, we divide the models into three major categories based on their model sizes: • Small models (#To <100M).ALBERT 12 (Lan et al., 2019) is the most representative PLM that achieves competitive results with only 11M.In addition, we consider PLMs (T5 12 ) and three compressed models that have similar parameters, namely MobileBERT (Sun et al., 2020b), Distil-BERT (Sanh et al., 2019) and TinyBERT (Jiao et al., 2019).We compare these compressed models to show the benefit of scaling to deeper models over compressing large models to small variants.
• Base models (#To > 100M).We compare with BERT 12 , XLNet 12 , RoBERTa 12 and BART 12 for this category.Note that we only include the base variants that have similar model sizes in order to make a fair comparison.
More details about the experiments are described in Appendix A.4.

Main Results
Fully-supervised setting.We present the results of MPOBERT and other baseline models on GLUE and Squad for fine-tuning in Table 1.
Firstly, we evaluate MPOBERT's performance in comparison to other models with similar numbers of parameters.In particular, for small models, MPOBERT 48 outperforms the best baseline models and achieves substantial improvements on both the development set (85.8 v.s.84.8 for T5 12 ) and test sets (82.6 v.s.81.2 for ALBERT 24 ).This highlights the benefits of increased capacity from layer-specific parameters (i.e., the auxiliary tensors and layer-specific adapters) in MPOBERT.Furthermore, for small and base models, 48-layer MPOBERT consistently achieves better results than all parameter-efficient models, while also achieving comparable results to other 12-layer PLMs with a reduced number of parameters.This demonstrates the significant benefits of scaling along the model depth with layer-specific parameters in MPOBERT.
Secondly, we assess MPOBERT's parameter efficiency by comparing it to other PLMs within the same model depth.For instance, when considering models with L=12 layers, MPOBERT achieves comparable results or even outperforms (+1.7 for BERT 12 and +0.4 for XLNet 12 ) PLMs while having fewer parameters.This further highlights the advantages of MPOBERT's parameter-efficient approach in constructing deep models.
Multitask Fine-tuning Setting.To demonstrate the effectiveness of our proposed parametersharing model in learning shared representations across multiple tasks, we fine-tune MPOBERT, BERT and ALBERT on the multitask GLUE benchmark and report the results in Table 2. Specifically, we design two groups of experiments.(1) Deep vs. shallow models.Comparing with BERT 12 , MPOBERT 48 has much deeper Transformer layers but still fewer total parameters (i.e., 75M vs. 110M).We find that MPOBERT 48 achieves 1.4 points higher on average GLUE score than BERT 12 .(2) Central tensors sharing vs. all weight sharing.Comparing with ALBERT 12 , MPOBERT 12 only shares part of weights, i.e., central tensors, while ALBERT 12 shares all of the weights.We find that sharing central tensors may effectively improve the average results than sharing all weights (82.0 v.s.81.4 for MRPC).
Few-shot Learning Setting.We evaluate the performance of our proposed model, MPOBERT, in few-shot learning setting (Huang et al., 2022)  that MPOBERT outperforms BERT, which suffers from over-fitting, and ALBERT, which does not benefit from its reduced number of parameters.These results further demonstrate the superiority of our proposed model in exploiting the potential of large model capacity under limited data scenarios.

Detailed Analysis
Analysis of Initialization Methods.This experiment aims to exclude the effect of initialized pretrained weights on fine-tuning results.We plot the performance of the model on SST-2 w.r.t training steps.In particular, we compare the performance of MPOBERT using different initialization methods (Xavier in Fig. 3(a) and decomposed weights of ALBERT in Fig. 3(b)) for pre-training.The results demonstrate that pre-training MPOBERT from scratch requires around 50k steps to achieve performance comparable to BERT BASE , while initializing with the decomposed weights of ALBERT significantly accelerates convergence and leads to obvious improvements within the first 10k training steps.In contrast, the gains from continual pretraining for ALBERT are negligible.These results provide assurance that the improvements observed in MPOBERT are not solely attributed to the use of initialized pre-trained weights.Ablation Analysis.To assess the individual impact of the components in our MPOBERT model, we conduct an ablation study by removing either the layer-specific adapter or the cross-layer parametersharing strategy.The results, displayed in Table 4, indicate that the removal of either component results in a decrease in the model's performance, highlighting the importance of both components in our proposed strategy.While the results also indicate that cross-layer parameter sharing plays a more crucial role in the model's performance.
Performance Comparison w.r.t Adapter Rank.
To compare the impact of the adapter rank in layerspecific adapters on MPOBERT's performance, we trained MPOBERT with different ranks (4,8 and 64) and evaluate the model on downstream tasks in also observe a decrease in the performance of the variant with adapter rank 64.This illustrates that further increasing the rank may increase the risk of over-fitting in fine-tuning process.Therefore, we set a rank of 8 for MPOBERT in the main results.
Analysis of Linguistic Patterns.To investigate the linguistic patterns captured by MPOBERT, BERT, and ALBERT, we conduct a suite of probing tasks, following the methodology of Tenney et al. (2019).These tasks are designed to evaluate the encoding of surface, syntactic, and semantic information in the models' representations.The results, shown in Fig. 4, reveal that BERT encodes more local syntax in lower layers and more complex semantics in higher layers, while ALBERT does not exhibit such a clear trend.However, MPOBERT exhibits similar layer-wise behavior to BERT in some tasks (i.e., task 0,2,4), and improved results in lower layers for others (i.e., task 3) which is similar to ALBERT.The result demonstrates that MPOBERT captures linguistic information differently than other models, and its layer-wise parameters play an important role in this difference.

Conclusion
We develop MPOBERT, a parameter-efficient pretrained language model that allows for the efficient scaling of deep models without the need for additional parameters or computational resources.We achieve this by introducing an

Limitations
The results presented in our study are limited by some natural language processing tasks and datasets that are evaluated, and further research is needed to fully understand the interpretability and robustness of our MPOBERT models.Additionally, there is subjectivity in the selection of downstream tasks and datasets, despite our use of widely recognized categorizations from the literature.Furthermore, the computational constraints limited our ability to study the scaling performance of the MPOBERT model at deeper depths such as 96 layers or more.This is an area for future research.

Ethics Statement
The use of a large corpus for training large language models may raise ethical concerns, particularly regarding the potential for bias in the data.In our study, we take precautions to minimize this issue by utilizing only standard training data sources, such as BOOKCORPUS and Wikipedia, which are widely used in language model training (Devlin et al., 2018;Lan et al., 2019).However, it is important to note that when applying our method to other datasets, the potential bias must be carefully considered and addressed.Further investigation and attention should be given to this issue in future studies.

A Appendix
A.1 Matrix Product Operators Formally, given a weight matrix W ∈ R I×J , we can factorize the two dimensions into a product of natural numbers, and reshape it into an ndimension tensor W i 1 ,...,in,j 1 ,...,jn , which satisfies: This decomposition can be written as: where the This bond dimension indicates the associative strength between two adjacent tensors.For clarity, we can rewrite the decomposition results as central tensor C and auxiliary tensors {A i } n−1 i=1 .As an important merit, such a decomposition can effectively reorganize and aggregate the information of the matrix (Gao et al., 2020a): central tensor C can encode the essential information from the original matrix, while auxiliary tensors {A i } n−1 i=1 serve as its complement to precisely reconstruct the matrix.
The k-th order and k ∈ {1, . . ., D}.The bond dimension d k is defined by: From Eq. ( 5), we can see that is going to be large in the middle and small on both sides.Algorithm 1 presents a thorough algorithm for MPO decomposition.
The MPO representation of W is obtained by factorizing it into a sequential product of local tensors.

A.2 Proofs
Notations.We denote L(•) as the loss function.LN (x) as the standard layer normalization with scale γ = 1 and bias β = 0. Let O(•) denote standard Big-O notation that suppresses multiplicative constants.

Θ
= stands for equal bound of magnitude.We aim to study the magnitude of the model updates.We define the model update as ∥△F ∥.
With assumption 2, we have: Then, with Taylor expansion, the model update With Eq. ( 8), the magnitude of ∂f l ∂x and ∂f l ∂θ is bounded by: Since we apply MPO decomposition to θ l , we get: For simplicity, we reduce the matrices U ,C,V to the scalars u,c,v.Thus with Assumption 3, Eq. ( 9) is reformulated as: Finally, with Assumption 3 we have: 2 Corollary A.2 Given that we initialise c 1 in MPOBERT with well-trained weights, it is reasonable to assume that updates of c 1 are well-bounded.Then △F satisfies ∥△F ∥ = O(1) when for all i = 1, • • • , N : P roof.For an N -layer MPOBERT, we have: = ∥θ N ∥, we achieve: Finally, we achieve: Due to symmetry, we set u i = u, v i = v.Thus, from A.2, we set u = v = (2N ) − 1 4 to achieve to bound the magnitudes of each update to be independent of model depth N , i.e., ∥△F ∥ = O(1).2

A.2.1 Details of Training A.3 Training Details
Here we describe the details of the pre-training process in Algorithm 2. For pre-training, we tune the learning rate in the range of [1.0 × 10 −5 , 1.0 × 10 −6 ] and use the LAMB optimizer (You et al., 2020).Since fine-tuning is typically fast, we run an exhaustive parameter search (i.e., learning rate in the range of [2.0 × 10 −4 , 2.0 × 10 −6 ], batch size in {8,16,32}) and choose the model that performs best on the development set to make predictions on the test set.

A.3.1 Details of Training Configurations
In this part, we list the training configurations of MPOBERT and other representative PLMs in Table 6.It is important to highlight that MPOBERT holds the promise of substantially enhancing inference speed.This is mainly attributed to the fact that the speed of inference is typically constrained by memory bandwidth, which, in turn, is restricted by the available memory capacity.To elucidate this point, envision a situation where the model weights utilized in matrix multiplications are stored in a smaller yet high-bandwidth memory.In such cases, there exists a notable potential for achieving a significant speed boost, particularly when compared with the situation where these weights must be fetched from a larger memory characterized by lower bandwidth.

Figure 1 :
Figure 1: A comparison of our model and representative PLMs in the dimensions of model size and model depth.

Figure 2 :
Figure 2: Overview architecture of MPOBERT and MPOBERT + .We use blocks with dashed borderlines to represent shared central tensors.Central tensors are shared across all L Layers in MPOBERT and within groups in MPOBERT + .

Figure 3 :Figure 4 :
Figure 3: Comparison of the SST-2 accuracy achieved through pre-training from scratch and pre-training with the initialization of decomposed ALBERT weights.

Table 3 :
Comparison of few-shot performance.

Table 5 .
The results demonstrate that a rank of 8 is sufficient for MPOBERT, which further shows the necessity of layer-specific adapters.However, we

Table 5 :
Comparison of different adapter ranks on three GLUE tasks (in percent)."Rank" denotes the adapter rank in MPOBERT.

Table 7 :
Model Performance Comparison The remarkable enhancement in the performance of MPOBERT-48 is clearly demonstrated by the results of the GLUE Test (82.6 vs. 79.1),underscoring the value of the computational load associated with this part.As a result, we recommend prioritizing MPOBERT-48 in scenarios where performance takes precedence.