Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

This paper presents a novel pre-trained language models (PLM) compression approach based on the matrix product operator (short as MPO) from quantum many-body physics. It can decompose an original matrix into central tensors (containing the core information) and auxiliary tensors (with only a small proportion of parameters). With the decomposed MPO structure, we propose a novel fine-tuning strategy by only updating the parameters from the auxiliary tensors, and design an optimization algorithm for MPO-based approximation over stacked network architectures. Our approach can be applied to the original or the compressed PLMs in a general way, which derives a lighter network and significantly reduces the parameters to be fine-tuned. Extensive experiments have demonstrated the effectiveness of the proposed approach in model compression, especially the reduction in fine-tuning parameters (91% reduction on average). The code to reproduce the results of this paper can be found at https://github.com/RUCAIBox/MPOP.


Introduction
Recently, pre-trained language models (PLMs) (Devlin et al., 2019;Peters et al., 2018;Radford et al., 2018) have made significant progress in various natural language processing tasks. Instead of training a model from scratch, one can fine-tune a PLM to solve some specific task through the paradigm of "pre-training and fine-tuning".
Typically, PLMs are constructed with stacked Transformer layers (Vaswani et al., 2017), involving a huge number of parameters to be learned. Though effective, the large model size makes it impractical for resource-limited devices. Therefore, there is an increasing number of studies focused on the parameter reduction or memory reduction of PLMs (Noach and Goldberg, 2020), including parameter sharing (Lan et al., 2020), knowledge distillation (Sanh et al., 2019), low-rank approximation (Ma et al., 2019) and data quantization (Hubara et al., 2017). However, these studies mainly apply parameter reduction techniques to PLM compression, which may not be intrinsically appropriate for the learning paradigm and architecture of PLMs. The compressed parameters are highly coupled so that it is difficult to directly manipulate different parts with specific strategies. For example, most PLM compression methods need to fine-tune the whole network architecture, although only a small proportion of parameters will significantly change during fine-tuning .
In this paper, we introduce a novel matrix product operator (MPO) technique from quantum manybody physics for compressing PLMs (Gao et al., 2020). The MPO is an algorithm that factorizes a matrix into a sequential product of local tensors (i.e., a multi way array). Here, we call the tensor right in the middle as central tensor and the rest as auxiliary tensors. An important merit of the MPO decomposition is structural in terms of information distribution: the central tensor with most of the parameters encode the core information of the original matrix, while the auxiliary tensors with only a small proportion of parameters play the role of complementing the central tensor. Such a property motivates us to investigate whether such an MPO can be applied to derive a better PLM compression approach: can we compress the central tensor for parameter reduction and update the auxiliary tensors for lightweight fine-tuning? If this could be achieved, we can derive a lighter network meanwhile reduce the parameters to be fine-tuned.
To this end, we propose an MPO-based compression approach for PLMs, called MPOP. It is developed based on the MPO decomposition tech-nique (Gao et al., 2020;Pirvu et al., 2010). We have made two critical technical contributions for compressing PLMs with MPO. First, we introduce a new fine-tuning strategy that only focuses on the parameters of auxiliary tensors, so the number of fine-tuning parameters can be largely reduced. We present both theoretical analysis and experimental verification for the effectiveness of the proposed fine-tuning strategy. Second, we propose a new optimization algorithm, called dimension squeezing, tailored for stacked neural layers. Since mainstream PLMs usually consist of multiple Transformer layers, this will produce accumulated reconstruction error by directly applying low-rank approximation with MPO at each layer. The dimension squeezing algorithm is able to gradually perform the dimension truncation in a more stable way so that it can dramatically alleviate the accumulation error in the stacked architecture.
To our knowledge, it is the first time that MPO is applied to the PLM compression, which is well suited for both the learning paradigm and the architecture of PLMs. We construct experiments to evaluate the effectiveness of the proposed compression approach for ALBERT, BERT, DistillBERT and MobileBERT, respectively, on GLUE benchmark. Extensive experiments have demonstrated the effectiveness of the proposed approach in model compression, especially dramatically reducing the finetuning parameters (91% reduction on average).

Related Work
We review the related works in three aspects.

Pre-trained Language Model Compression.
Since the advent of large-scale PLMs, several variants were proposed to alleviate its memory consumption. For example, DistilBERT (Sanh et al., 2019) and MobileBERT (Sun et al., 2020c) leveraged knowledge distillation to reduce the BERT network size. SqueezeBERT (Iandola et al., 2020) and Q8BERT (Zafrir et al., 2019) adopted special techniques to substitute the operations or quantize both weights and activations. ALBERT (Lan et al., 2020) introduced cross-layer parameter sharing and low-rank approximation to reduce the number of parameters. More studies (Jiao et al., 2020;Hou et al., 2020;Khetan and Karnin, 2020;Pappas et al., 2020;Sun et al., 2020a) can be found in the comprehensive survey (Ganesh et al., 2020).
Tensor-based Network Compression. Tensorbased methods have been successfully applied to neural network compression. For example, MPO has been utilized to compress linear layers of deep neural network (Gao et al., 2020). Sun et al. (2020b) used MPO to compress the LSTM model on acoustic data. Novikov et al. (2015) coined the idea of reshaping weights of fully-connected layers into high-dimensional tensors and representing them in Tensor Train (TT) (Oseledets, 2011) format, which was extended to other network architectures (Garipov et al., 2016;Yu et al., 2017;Tjandra et al., 2017;Khrulkov et al., 2019). Ma et al. (2019) adopted block-term tensor decomposition to compress Transformer layers in PLMs.
Lightweight Fine-tuning. In the past, lightweight fine-tuning was performed without considering parameter compression. As a typical approach, trainable modules are inserted into PLMs. For example, a "side" network is fused with PLM via summation in (Zhang et al., 2020), and adapter-tuning inserts task-specific layers (adapters) between each layer of PLMs (Houlsby et al., 2019;Rebuffi et al., 2017). On the contrary, several studies consider removing parameters from PLMs. For example, several model weights are ablated away by training a binary parameter mask Radiya-Dixit and Wang, 2020).
Our work is highly built on these studies, while we have a new perspective by designing the PLM compression algorithm, which enables lightweight fine-tuning. It is the first time that MPO is applied to PLM compression, and we make two major technical contributions for achieving lightweight finetuning and stable optimization.

Preliminary
In this paper, scalars are denoted by lowercase letters (e.g., a), vectors are denoted by boldface lowercase letters (e.g., v), matrices are denoted by boldface capital letters (e.g., M), and high-order (order three or higher) tensors are denoted by boldface Euler script letters (e.g., T ). An n-order tensor T i 1 ,i 2 ,...in can be considered as a multidimensional array with n indices {i 1 , i 2 , ..., i n }.
Matrix Product Operator. Originating from quantum many-body physics, matrix product operator (MPO) is a standard algorithm to factorize a matrix into a sequential product of multiple local tensors (Gao et al., 2020;Pirvu et al., 2010). Formally, given a matrix M ∈ R I×J , its MPO decomposition into a product of n local tensors can be represented as: We use the concept of bond to connect two adjacent tensors (Pirvu et al., 2010). The bond dimension d k is defined by: (2), we can see that d k is going to be large in the middle and small on both sides. We present a detailed algorithm for MPO decomposition in Algorithm 1. In this case, we refer to the tensor right in the middle as central tensor, and the rest as auxiliary tensor. Figure 1 presents the illustration of MPO decomposition, and we use n = 5 in this paper.
Algorithm 1 MPO decomposition for a matrix. (1), we can exactly reconstruct the original matrix M through the product of the derived local tensors. Following (Gao et al., 2020), we can truncate the k-th bond dimension d k (see Eq. (1)) of local tensors to d k for low-rank approximation: d k > d k . We can set different values for {d k } n k=1 to control the expressive capacity of MPO-based reconstruction. The truncation error induced by the k-th bond dimension d k is denoted by k (called local truncation error) which can be efficiently computed as: where Then the total truncation error satisfies: The proof can be found in the supplementary materials 1 . Eq. (1) indicates that the reconstruction error is bounded by the sum of the squared local truncation errors, which is easy to estimate in practice. Suppose that we have truncated the dimensions of local tensors from {d k } n k=1 to {d k } n k=1 , the compression ratio introduced by quantum many-body physics (Gao et al., 2020) can be computed as follows: The smaller the compression ratio is, the fewer parameters are kept in the MPO representation. On the contrary, the larger the compression ratio ρ is, and the more parameters there are, and the smaller the reconstruction error is. When ρ > 1, it indicates the decomposed tensors have more parameters than the original matrix.

Approach
So far, most of pre-trained language models (PLM) are developed based on stacked Transformer layers (Vaswani et al., 2017). Based on such an architecture, it has become a paradigm to first pre-train PLMs and then fine-tunes them on task-specific data. The involved parameters of PLMs can be generally represented in the matrix format. Hence, it would be natural to apply MPO-based approximation for compressing the parameter matrices in PLMs by truncating tensor dimensions.
In particular, we propose two major improvements for MPO-based PLM compression, which can largely reduce the fine-tuning parameters and effectively improve the optimization of stacked architecture, respectively.

Lightweight Fine-tuning with Auxiliary Tensors
Due to the high coupling of parameters, previous PLM compression methods usually need to finetune all the parameters. As a comparison, the MPO approach decomposes a matrix into a list of local tensors, which makes it potentially possible to consider fine-tuning different parts with specific strategies. Next, we study how to perform lightweight fine-tuning based on MPO properties.
Parameter Variation from Pre-Training. To apply our solution to lightweight fine-tuning, we first conduct an empirical experiment to check the variation degree of the parameters before and after finetuning. Here, we adopt the standard pre-trained BERT (Devlin et al., 2019) and then fine-tune it on the SST-2 task (Socher et al., 2013). We first compute the absolute difference of the variation for each parameter value and then compute the ratio of parameters with different variation levels. The statistical results are reported in Table 1. As we can see, most of parameters vary little, especially for the word embedding layer. This finding has also been reported in a previous studies (Khetan and Karnin, 2020). As discussed in Section 3, after MPO decomposition, the central tensor contains the majority of the parameters, while the auxiliary tensors only contain a small proportion of the parameters. Such merit inspires us to consider only fine-tuning the parameters in the auxiliary tensors while keeping the central tensor fixed during finetuning. If this approach was feasible, this will largely reduce the parameters to be fine-tuned.
Theoretical Analysis. Here we introduce entanglement entropy from quantum mechanics (Calabrese and Cardy, 2004) as the metric to measure the information contained in MPO bonds, which is similar to the entropy in information theory but replaces probabilities by normalized singular values produced by SVD. This will be more suitable for measuring the information of a matrix as singular values often correspond to the important information implicitly encoded in the matrix, and the importance is positively correlated with the magnitude of the singular values. Following (Calabrese and Cardy, 2004), the entanglement entropy S k corresponding to the k-th bond can be calculated by: where {v j } d k j=1 denote the normalized SVD eigenvalues of M[i 1 j 1 ...i k j k , i k+1 j k+1 ...i n j n ]. The entanglement entropy S k is an increasing function of dimension d k as described in (Gao et al., 2020). Based on Eq. (2), the central tensor has the largest bond dimension, corresponding to the largest entanglement entropy. This indicates that most of the information in an original matrix will be concentrated in the central tensor. Furthermore, the larger a dimension is, the larger the updating effect will be. According to (Pirvu et al., 2010), it is also guaranteed in principle that any change on some tensor will be transmitted to the whole local tensor set. Thus, it would have almost the same effect after convergence by optimizing the central tensor or the auxiliary tensors for PLMs.
Based on the above analysis, we speculate that the affected information during fine-tuning is mainly encoded on the auxiliary tensors so that the overall variations are small. Therefore, for lightweight fine-tuning, we first perform the MPO decomposition for a parameter matrix, and then only update its auxiliary tensors according to the downstream task with the central tensor fixed. Experimental results in Section 5.2 will demonstrate that such an approach is indeed effective.

Dimension Squeezing for Stacked Architecture Optimization
Most of PLMs are stacked with multiple Transformer layers. Hence, a major problem with directly applying MPO to compressing PLMs is that the reconstruction error tends to be accumulated and amplified exponentially by the number of layers. It is thus urgent to develop a more stable optimization algorithm tailored to the stacked architecture.
Fast Reconstruction Error Estimation. Without loss of generality, we can consider a simple case in which each layer contains exactly one parameter matrix to be compressed. Assume that there are L layers, so we have L parameter matrices in total, denoted by {M (l) } L l=1 . Let C (l) denote the corresponding central tensor with a specific dimension d (l) after decomposing M (l) with MPO. Our idea is to select a central tensor to reduce its dimension by one at each time, given the selection criterion that this truncation will lead to the least reconstruction error. However, it is time-consuming to evaluate the reconstruction error of the original matrix. According to Eq. (3), we can utilize the error bound n−1 k=1 2 k for a fast estimation of the yielded reconstruction error. In this case, only one k changes, and it can be efficiently computed via the pre-computed eigenvalues.
Fast Performance Gap Computation. At each time, we compute the performance gap before and after the dimension reduction (d (l) → d (l) −1) with the stop criterion. To obtain the performancep after dimension reduction, we need to fine-tune the truncated model on the downstream task. We can also utilize the lightweight fine-tuning strategy in Section 4.1 to obtainp by only tuning the auxiliary tensors. If the performance gap p −p is smaller than a threshold ∆ or the iteration number exceeds the predefined limit, the algorithm will end. Such an optimization algorithm is more stable to optimize stacked architectures since it gradually reduces the dimension considering the reconstruction error and the performance gap. Actually, it is similar to the learning of variable matrix product states (Iblisdir et al., 2007) in physics, which optimizes the tensors one by one according to the sequence. As a comparison, our algorithm dynamically selects the matrix to truncate and is more suitable to PLMs.
Algorithm 2 presents a complete procedure for our algorithm. In practice, there are usually multiple parameter matrices to be optimized at each layer. This can be processed in a similar way: we select some matrices from one layer to optimize among all the considered matrices.
Algorithm 2 Training with dimension squeezing.

Overall Compression Procedure
Generally speaking, our approach can compress any PLMs with stacked architectures consisting of parameter matrices, even the compressed PLMs. In other words, it can work with the existing PLM compression methods to further achieve a better compression performance. Here, we select AL-BERT (Lan et al., 2020) as a representative compressed PLM and apply our algorithm to ALBERT.
The procedure can be simply summarized as follows. First, we obtain the learned ALBERT model (complete) and perform the MPO-decomposition to the three major parameter matrices, namely word embedding matrix, self-attention matrix and feedforward matrix 2 . Each matrix will be decomposed into a central tensor and auxiliary tensors. Next, we perform the lightweight fine-tuning to update auxiliary tensors until convergence on downstream tasks. Then, we apply the dimension squeezing Category Method Inference Time Tucker optimization algorithm to the three central tensors, i.e., we select one matrix for truncation each time. After each truncation, we fine-tune the compressed model and further stabilize its performance. This process will repeat until the performance gap or the iteration number exceeds the pre-defined threshold.
In this way, we expect that ALBERT can be further compressed. In particular, it can be finetuned in a more efficient way, with only a small amount of parameters to be updated. Section 5.2 will demonstrate this.

Discussion
In mathematics, MPO-based approximation can be considered as a special low-rank approximation method. Now, we compare it with other low-rank approximation methods, including SVD (Henry and Hofrichter, 1992), CPD (Hitchcock, 1927) and Tucker decomposition (Tucker, 1966).
We present the categorization of these methods in Table 2. For PLM compression, low-rank decomposition is only performed once, while it repeatedly performs forward propagation computation. Hence, we compare their inference time complexities. Indeed, all the methods can be tensor-based decomposition (i.e., a list of tensors for factorization) or matrix decomposition, and we characterize their time complexities with common parameters. Indeed, MPO and Tucker represent two categories of low-rank approximation methods. Generally, the algorithm capacity is larger with the increase of n (more tensors). When n > 3, MPO has smaller time complexity than Tucker decomposition. It can be seen that SVD can be considered as a special case of MPO when tensor dimension n = 2 and CPD is a special case of Tucker when the core tensor is the super-diagonal matrix.
In practice, we do not need to strictly follow the original matrix size. Instead, it is easy to pad additional zero entries to enlarge matrix rows or columns, so that we can obtain different MPO decomposition results. It has demonstrated that different decomposition plans always lead to almost the same results (Gao et al., 2020). In our experiments, we adopt an odd number of local tensors for MPO decomposition, i.e., five local tensors (see supplementary materials). Note that MPO decomposition can work with other compression methods: it can further reduce the parameters from the matrices compressed by other methods, and meanwhile largely reduce the parameters to be fine-tuned.

Experiments
In this section, we first set up the experiments, and then report the results and analysis.

Experimental Setup
Datasets. We evaluate the effectiveness of compressing and fine-tuning PLMs of our approach MPOP on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019). GLUE is a collection of 9 datasets for evaluating natural language understanding systems. Following (Sanh et al., 2019), we report macro-score (average of individual scores, which is slightly different from official GLUE score, since Spearman correlations are reported for STS-B and accuracy scores are reported for the other tasks) on the development sets for each task by fine-tuning MPOP.
Baselines. Our baseline methods include: • BERT (Devlin et al., 2019): The 12-layer BERT-base model was pre-trained on Wikipedia corpus released by Google.
• ALBERT (Lan et al., 2020): It yields a highly compressed BERT variant with only 11.6M parameters, while maintains competitive performance, which serves as the major baseline.
• MobileBERT (Sun et al., 2020c): It is equipped with bottleneck structures and a carefully designed balance between self-attentions and feedforward networks.
All these models are released by Huggingface 3 . We select these baselines because they are widely adopted and have a diverse coverage of compression techniques. Note that we do not directly com-  Table 3: Performance on GLUE benchmark obtained by fine-tuning ALBERT and MPOP. "ALBERT pub " and "ALBERT rep " denote the results from the original paper (Lan et al., 2020) and reproduced by ours, respectively. "#Pr" and "#To" denote the number (in millions) of pre-trained parameters and total parameters, respectively. pare our approach with other competitive methods (Tambe et al., 2020) that require special optimization tricks or techniques (e.g., hardware-level optimization).
Implementation. The original paper of ALBERT only reported the results of SST-2 and MNLI in GLUE. So we reproduce complete results denoted as "ALBERT rep " with the Huggingface implementation (Wolf et al., 2020). Based on the pre-trained parameters provided by Huggingface, we also reproduce the results of BERT, DistilBERT and Mo-bileBERT. To ensure a fair comparison, we adopt the same network architecture. For example, the number of self-attention heads, the hidden dimension of embedding vectors, and the max length of the input sentence are set to 12, 768 and 128, respectively.

Experimental Results
Note that our focus is to illustrate that our approach can improve either original (uncompressed) or compressed PLMs. In our main experiments, we adopt ALBERT as the major baseline, and report the comparison results in Table 3. Table 3, our approach MPOP is very competitive in the GLUE benchmark, and it outperforms ALBERT in all tasks (except MNLI) with a higher overall score of 79.7. Looking at the last column, compared with ALBERT, MPOP reduces total parameters by 22% (#To). In particular, it results in a significant reduction of pre-trained parameters by 91% (#Pr). Such a reduction is remarkable in lightweight fine-tuning, which dramatically improves the fine-tuning efficiency. By zooming in on specific tasks, the improvements over ALBERT are larger on CoLA, RTE and WNLI tasks. An interesting explanation is that RTE and WNLI tasks have small training sets (fewer than 4k samples). The lightweight fine-tuning strategy seems to work better with limited training data, which enhances the capacity of PLMs and prevents overfitting on downstream tasks.

Comparison with ALBERT. As shown in
Ablation Results. Our approach has incorporated two novel improvements: lightweight fine-tuning with auxiliary tensors and optimization with dimension squeezing. We continue to study their effect on the final performance. Here we consider three variants for comparison: (1) MPOP full and MPOP full+LFA are full-rank MPO representation (without reconstruction error), and fine-tune all the tensors and only auxiliary tensors, respectively. This comparison is to examine whether only fine-tuning auxiliary tensors would lead to a performance decrease.
(2) MPOP dir directly optimizes the compressed model without the dimension squeezing algorithm. This variant is used to examine whether our optimization algorithm is more suitable for stacked architecture. Table 3 (last three rows) shows the results when we ablate these. In particular, the dimension squeezing algorithm plays a key role in improving our approach (a significant performance decrease for MPOP dir ), since it is tailored to stacked architecture. Comparing MPOP full with MPOP full+LFA , it is noted that fine-tuning all the parameters seems to have a negative effect on performance. Compared with ALBERT, we speculate that fine-tuning a large model is more likely to overfit on small datasets (e.g., RTE and MRPC). These results show that our approach is able to further compress ALBERT with fewer fine-tuning parameters. Especially, it is also helpful to improve the capacity and robustness of PLMs.

Detailed Analysis
In this section, we perform a series of detailed analysis experiments for our approach.
Evaluation with Other BERT Variants. In general, our approach can be applied to either uncom-  pressed or compressed PLMs. We have evaluated its performance with ALBERT. Now, we continue to test it with other BERT variants, namely original BERT, DistilBERT and MobileBERT. The latter two BERT variants are knowledge distillation based methods, and the distilled models can also be represented in the format of parameter matrix. We apply our approach to the three variants. Table 4 presents the comparison of the three variants before and after the application of MPOP. As we can see, our approach can substantially reduce the network parameters, especially the parameters to be finetuned. Note that DistilBERT and MobileBERT are highly compressed models. These results show that our approach can further improve other compressed PLMs.
Evaluation on Different Fine-Tuning Strategies.
Experiments have shown that our approach is able to largely reduce the number of parameters to be fine-tuned. Here we consider a more simple method to reduce the fine-tuning parameters, i.e., only fine-tune the last layers of BERT. This experiment reuses the settings of BERT (12 layers) and our approach on BERT (i.e., MPOP B in Table 4). We fine-tune the last 1-3 layers of BERT, and compare the performance with our approach MPOP B . From Table 5, we can see that such a simple way is much worse than our approach, especially on the RTE task. Our approach provides a more principled way for lightweight fine-tuning. By updating auxiliary tensors, it can better adapt to task-specific loss, and thus achieve better performance.
Evaluation on Low-Rank Approximation. As introduced in Section 4.4, MPO is a special lowrank approximation method, and we first compare its compression capacity with other low-rank approximation methods. As shown in Table 2, MPO and Tucker decomposition represent two main categories of low-rank approximation methods. We select CPD (Henry and Hofrichter, 1992) Table 5: Comparison of different fine-tuning strategies on three GLUE tasks. The subscript number in BERT (·) denotes the index of the layers to be fine-tuned.  x-axis denotes the compression ratio (ρ in Eq. (5)) and y-axis denotes the reconstruction error, measured in the Frobenius norm.
for comparison because general Tucker decomposition (Tucker, 1966) cannot obtain results with reasonable memory. Our evaluation task is to compress the word embedding matrix of the released "bert-base-uncased" model 4 . As shown in Figure 2(a), MPO achieves a smaller reconstruction error with all compression ratios, which shows that MPO is superior to CPD. Another hyper-parameter in our MPO decomposition is the number of local tensors (n). We further perform the same evaluation with different numbers of local tensors (n = 3, 5, 7). From Figure 2(b), it can be observed that our method is relatively stable with respect to the number of local tensors. Overall, a larger n requires a higher time complexity and can yield flexible decomposition. Thus, we set n = 5 for making a trade-off between flexibility and efficiency.

Conclusion
We proposed an MPO-based PLM compression method. With MPO decomposition, we were able to reorganize and aggregate information in central tensors effectively. Inspired by this, we designed a novel fine-tuning strategy that only needs to finetune the parameters in auxiliary tensors. We also developed a dimension squeezing training algorithm for optimizing low-rank approximation over stacked network architectures. Extensive experiments had demonstrated the effectiveness of our approach, especially on the reduction of fine-tuning parameters. We also empirically found that such a fine-tuning way was more robust to generalize on small training datasets. To our knowledge, it is the first time that MPO decomposition had been applied to compress PLMs. In future work, we will consider exploring more decomposition structures for MPO.