Numerical Optimizations for Weighted Low-rank Estimation on Language Models

Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. The parameters of a trained neural network model may affect the task performance unevenly, which suggests non-equal importance among the parameters. Compared to SVD, the decomposition method aware of parameter importance is the more practical choice in real cases. Unlike standard SVD, weighed value decomposition is a non-convex optimization problem that lacks a closed-form solution. We systematically investigated multiple optimization strategies to tackle the problem and examined our method by compressing Transformer-based language models.Further, we designed a metric to predict when the SVD may introduce a significant performance drop, for which our method can be a rescue strategy.The extensive evaluations demonstrate that our method can perform better than current SOTA methods in compressing Transformer-based language models.


Introduction
Transformer-based language models such as BERT (Devlin et al., 2018) have obtained significant success in a variety of Natural Language Processing tasks, such as language modeling (Radford et al., 2018), text classification (Wang et al., 2018), question answering (Rajpurkar et al., 2016), and summarization (Liu, 2019).Despite their success, these models usually contain millions or even billions of parameters, pre-trained by the large corpus.However, the downstream tasks may only focus on a specific scenario, such that only a small amount of parameters in the big Transformer model will contribute to the performance of the target task.Also, the massive size of Transformer models prohibits * These authors contributed equally to this work.their deployments to resource-constrained devices.Therefore, compression of the Transformer-based language model attracts extensive interests.
Low-rank factorization (Golub and Reinsch, 1971;Noach and Goldberg, 2020) aims to approximate each parameter matrix in the trained model by two smaller matrices.This line of compression strategy will naturally inherit the knowledge of the big trained model without expensive generic retraining, and is the orthogonal direction to other compression approaches such as Knowledge distillation (Sun et al., 2019;Sanh et al., 2019;Jiao et al., 2019) or Quantization (Shen et al., 2020;Zhao et al., 2021).
However, applying standard SVD to approximate the learned weights often results in a significant task performance drop.Previous work shows that this phenomenon may be caused by a strong assumption held by the standard SVD, that the parameters in the matrix are equally crucial to the performance (Hsu et al., 2021).Also, it has been observed that different parameters in Transformer models have different impacts on the overall task performance (Shen et al., 2020).
Following FWSVD (Hsu et al., 2021), we utilize Fisher information (Pascanu and Bengio, 2014) to weigh the importance of parameters, so that the objective of matrix factorization will jointly consider matrix reconstruction error and the target task performance.In the standard SVD, all the local minima are saddle points, ensuring a closed-form global optimal solution (Srebro and Jaakkola, 2003).This property no longer holds true to our new objective weighted by Fisher information.Without the closed-form solution, we revert to the numerical optimization methods to minimize the weighted objective.As our method can provide a more accurate solution than FWSVD (Hsu et al., 2021), we name our proposed method as TFWSVD (True Fisher Weighted SVD).Our results reveal the hybrid optimizer we called Adam_SGD can best fit our problem, with its switching point estimated by the row-based analytic solution.We also investigated the scenarios where SVD fails, under the guidance of the metric we introduced to measure the variance of parameter importance, with the example of analyzing the matrices within the Transformer blocks.
In summary, this work makes the following contributions: (1) we provide several optimization methods to search for the best numerical solution for low-rank estimation weighted by the Fisher information; (2) we perform extensive evaluations on various language tasks, showing our TFWSVD achieves better performance than the SOTA compression methods, and can further compress already compact models; (3) through the analysis of factorizing sub-structures inside the Transformer blocks, we provide the guide about when SVD may fail but TFWSVD can retain the performance.

Model Compression with SVD
Singular value decomposition (SVD) decomposes a matrix, e.g., W∈ R N ×M into three matrices: where U∈ R N ×l , V∈ R M ×l , and l is the rank of matrix W. S is a diagonal matrix of non-zero singular values diag(σ 1 , , ..., σ l ), where , and V r represent the truncated matrices with rank r and approximate the original matrix with a less total number of parameters.
The computation of a linear layer in neural networks can be rewritten as below with input data X∈ R 1×N , weight matrix W∈ R N ×M , and bias b∈ R l×M : The typical implementation of factorization is to replace the large W with two smaller linear layers: 1) The weight matrix of the first layer is US, which has N r parameters without bias.2) While the weight matrix of the second layer is V, with M r parameters plus bias.The truncation happens when r is less than l.For example, if the total number of parameters for approximating W is N r + M r, then the reduced number of parameters will be N M − (N r + M r).

Fisher information
A classical way to measure the importance of parameters is through the observed information, i.e.Fisher information.It measures the amount of information that an observable dataset D carries about a model parameter w.The accurate values of Fisher information are generally intractable since the computation will require marginalizing over the data D. In practice, the empirical Fisher information is estimated as follows: (3) Given a target task objective L (e.g., cross-entropy for a classification task), the estimated information Îw accumulates the squared gradients over the training data d i ∈ D. The parameters that cause large absolute gradient of the task objective will have a large value in Îw , and are considered important to the target task.

Related works
The report of applying SVD to the Transformer layers is scarce.Several previous works applied SVD to compress the word embedding layer (Chen et al., 2018a;Acharya et al., 2019).Although (Noach and Goldberg, 2020) combined knowledge distillation to fine-tune the resulting compressed model, they didn't address the issue of poor performance when fine-tuning is not applied.Experiments show that our proposed method can retrain most of the performance, providing a much better initialization for the fine-tuning.The use of Fisher information has appeared in many problem settings that also need to estimate the importance of model parameters, for example, to avoid catastrophic forgetting in continual learning (Kirkpatrick et al., 2017;Hua et al., 2021) or model pruning (Liu et al., 2021;Molchanov et al., 2019b).However, none of these work has explored its potential in assisting low-rank approximation for model compression.
Most previous work seeking the numerical solution for low-rank approximation is designed for unweighted cases, with applications such as predicting the missing values recommendation system (Yu et al., 2014;Zhou et al., 2008).Also, a few attempts have been made to solve the weighted lowrank approximation problem through EM-based algorithm (Srebro and Jaakkola, 2003), or alternating least squares (He et al., 2016).
The closest previous work to this paper is FWSVD (Hsu et al., 2021), which points out that the "even importance" assumption held by SVD may cause a performance drop.FWSVD also utilizes Fisher information to weigh the importance of parameters.However, during the decomposition process, FWSVD assumes that parameters within each weight matrix row share the same importance value, which is still a strong assumption.Experimental results show that our TFWSVD can find more accurate solutions than FWSVD, as each parameter is associated with its own importance in TFWSVD.

Low-rank factorization objective weighted by Fisher information
The objective of the generic low-rank approximation is to minimize the Frobenius norm ||W − AB|| 2 , which is the sum squared differences of a reconstructed matrix AB to the target matrix W. As mentioned above, Singular value decomposition (SVD) can solve this problem efficiently by having A = US and B = V T .As the importance of each element w ij in W can be calculated through its Fisher information, we would like to find the reconstructed matrix AB that minimizes the weighted Frobenius distance J(A, B) as follows (⊗ denotes element-wise multiplication): To prevent over fitting, L 2 regularization terms controlled by parameter λ can be added to the objective, so that Equation (4) can be rewritten as: (5)

Optimization methods
SVD has an analytic solution, since all of its local minima are global.However, this can not hold true when weights are introduced.Without a closedform solution, we discuss several numerical optimization methods to minimize J(A, B).

Alternating Least Squares
Although the optimization problems in (4) and ( 5) are non-convex, they can be converted to quadratic problems with globally optimal solutions, if A or B is fixed.Therefore, Alternating Least Squares (ALS) is suitable to solve such problems (Hastie et al., 2015).ALS will alternately optimize A or B by keeping the other one fixed, and decrease J(A, B) until convergence.When the other matrix is fixed, minimizing J(A, B) with respect to A or B is equivalent to minimize the following objectives: which can lead to the closed-form solutions: where Σ is the identity matrix, while ÎW [i,:] and ÎW [:,j] are the Fisher information vector of i-th row and j-th column in original matrix W, respectively.

Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is also shown to be effective for matrix factorization problems (Koren et al., 2009).Specifically in our problem, each update of SGD can be represented as: where η is the learning rate, and e w i,j = Îw i,j (w i,j − a T i b j ).More generally, the iterations of SGD can be described as: where h (k) denotes the k-th iterate that can be substituted by a i or b j .

Adaptive Moment Estimation
SGD will scale gradient uniformly in all directions, making the training process inefficient and sensitive to the learning rate.Several adaptive methods have been proposed to overcome this shortcoming, among which Adaptive Moment Estimation (Adam) is one of the most widely used approaches (Kingma and Ba, 2015).Following the form of SGD updates shown in ( 9), the Adam update iterations can be written as: ) where h (k) and η are the same as Equation ( 9), m (k−1) and v (k−1) are calculated as follows: (11) Although Adam requires minimal tuning and enjoys fast initial progress, it is not without faults.Recent work has shown that the solutions found by Adam can be much worse at generalization than those found by SGD (Akiba et al., 2017;Ida and Fujiwara, 2020).

Adam Switching to SGD
Previous studies show that switching from Adam to SGD may contribute to the performance, however, the switching point is crucial for the overall performance and usually is task-dependent (Ida and Fujiwara, 2020).Here we propose a simple method to calculate the switching point for our Fisher information weighted matrix factorization problem.
Although weighted SVD does not have a closedform solution when each element has its weight, the optimization problem (5) has a close form in the case that elements within the same row share the same weight (Hsu et al., 2021).Therefore, we can calculate an approximate solution for the optimization problem (5) based on row-wise Fisher information, which can be solved as the "threshold" for our switching point from Adam to SGD (Hsu et al., 2021).If we define the importance for the row i to be the summation of the row, i.e., ÎW i = j ÎW ij and diagonal matrix Î = diag( ÎW 1 , ..., ÎW N ), then the optimization problem of Equation ( 4) can be written as: Optimization problem ( 12) can be solved by the standard SVD on ÎW .If we denote svd( ÎW ) = (U * , S * , V * ), then the solution of Equation ( 12) will be A = Î−1 U * S * , and B = V * T .The value of Ĵ(A, B) is served as our switching point from Adam to SGD, that the training process will be optimized by Adam when the current loss is larger than Ĵ(A, B), and then taken over by SGD when its loss is smaller than Ĵ(A, B).
Besides the hard threshold calculated in (12), we also set a soft threshold that restricts our unweighted reconstruction error with the same order of magnitude as that of SVD.Experiments in Section 4.5 show that our switching point can well balance the speed and convergence of the optimization process.

Metric measuring when SVD may fail
Besides an accurate solution to the J(A, B), whether TFWSVD can obtain a performance gain is also decided by the properties of the target matrix W itself. TFWSVD is to capture the different importance of parameters.However, if the parameters in W equally contributed to the model performance, then the standard SVD should be good enough.Driven by these factors, we are interested in this question: Is there a method that can "foresee" when SVD will fail, and TFWSVD can help retain performance?
Given target matrix W, here we propose a simple but effective metric called Fisher information variance φ(W), which is calculated as the variance of the L p normalization of its corresponding Fisher information ÎW : As shown in Section 4.6, this metric can qualitatively measure whether the targeted matrix is too challenging to SVD and therefore needs help from TFWSVD.

Language tasks and datasets
We evaluate our proposed methods and baselines on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and a token classification task.More details about datasets and tasks can be found in Appendix A.

Implementation details and baselines
For generic compact methods (MiniLM, Distil-BERT, and TinyBERT), we use the models provided by the original authors as the initialization, then directly fine-tune them on the training data of the target task.The fine-tuning is optimized by Adam with a learning rate of 2 × 10 −5 and batch size of 32 on one GPU.
Besides FWSVD (Hsu et al., 2021) and our proposed TFWSVD, we also provide a baseline using first-order Taylor expansion for value decomposition (TVD).The details of TVD can be found in Appendix B.
For low-rank factorization methods (TFWSVD, FWSVD, TVD, and SVD), we use the pre-trained 12-layer BERT model (Devlin et al., 2018) as the start.And then, the large BERT model is finetuned on the task-specific data.Next, we apply the low-rank factorization, followed by another finetuning.We reported the results with and without fine-tuning to reveal the native results of low-rank factorization.
To make a fair comparison, only the linear layers in the transformer blocks are compressed in this work.The non-Transformer modules, such as the token embedding, are not compressed.Previous works (Chen et al., 2018a) have shown significant success in applying low-rank factorization to compress the embedding layer, which occupies 23.4M (21.3%) parameters in the standard BERT model.Thus, the results we reported in this paper can be further improved by applying our method to nontransformer modules.
All of our implements are created on the base of HuggingFace Transformer library (Wolf et al., 2020).The settings not mentioned use the default configuration of the HuggingFace Transformer library.We directly reported the results on the dev set for all datasets, as hyper-parameter searching is not involved in our experimental evaluations.

Performance comparisons with SOTA
Table 1 reports the results of GLUE tasks and one NER task CoNLL.Our TFWSVD with 66.5M parameters obtains G-Avg score of 83.1 and A-Avg score of 84.4, which are better than the scores of SOTA models (MiniLMv2, TinyBERT6, dis-tilBERT) requiring generic re-training.TFWSVD consistently yields good results on all the tasks, while the other generic re-training methods display obvious performance variance among different tasks.For example, TinyBERT6 is good at the STSB task but poor at CoLA; oppositely, dis-tilBERT has strong performance on CoLA but is weak at STSB.
In the comparisons among low-rank factorization methods (TFWSVD, FWSVD, TVD, and SVD), our TFWSVD beats other methods with apparent better performance in both scenarios with or without fine-tuning.One interesting phenomenon is that TVD can yield better results than SVD without fine-tuning.However, after fine-tuning, its advantages disappear, and SVD can achieve better average scores (G-Avg and A-Avg).This is not surprising.Similar to our proposed TFWSVD, TVD is also a loss-aware method that definitely will be better than the loss-unaware SVD.But this gap can be narrowed or even eliminated with fine-tuning since SVD can also "see" the loss in this case.Therefore, within loss-aware methods, the weighting metric itself plays an important role in keeping the performance advantage.Also, TFWSVD obtains better performance than FWSVD, which indicates it is too "aggressive" for FWSVD to assume that parameters in the same row share the same importance.

Under high compression rates
In this part, we compared low-rank methods under high compression rates.Because TVD didn't show an apparent advantage over SVD, here we mainly focus on comparing our proposed TFWSVD, FWSVD, and standard SVD.
As can be seen from Table 2, TFWSVD always enjoys obvious advantages over the other two methods.Also, the performance gap between TFWSVD and FWSVD is enlarged as the compression rate goes higher.In fact, under the extremely compact setting of 37.2M, FWSVD shows worse performance compared to SVD.This phenomenon further proves that the row-based importance assumption held by FWSVD may hurt the performance.While the privilege of TFWSVD always exists and becomes more prominent in the high compression rate of 49.9M and 37.2M.Especially in the scenario without fine-tuning, which can best reveal the pure performance of low-rank factorization, TFWSVD has performance scores almost double that of FWSVD and SVD.

Optimization methods
In this part, we compare optimization procedures mentioned in Section 3.2 to identify the best optimizer for our approximation problem.

ALS and SGD
In order to update the latent vectors, ALS needs O(r 2 ) time to form the r × r matrix, with an additional O(r 3 ) time to solve the least-squares problem.Therefore, to reconstruct the target matrix W ∈ R N ×M with rank r, the time complexity of one ALS iteration is O((M + N )r 3 + M N r 2 ).It has been pointed out that ALS can be speeded up by parallelly updating each row of A or B independently (Zhou et al., 2008).While for SGD, the time complexity per iteration is only O(M N K).Compared to ALS, SGD seems to be faster in terms of the time complexity for one iteration.However, typically it requires more iterations than ALS to achieve relatively good performance (Yu et al., 2014).As shown in Table 3, in order to obtain the performance close to that of Adam/Adam_SGD, ALS and SGD need 50 ∼ 60 times more steps, which makes them impractical to be used in the real-world Transformer compression.Therefore, in the rest of this part, we will focus on comparing the performance of Adam and Adam_SGD.

Adam and Adam_SGD
The goal of hybrid optimizer Adam_SGD is to combine the benefits of Adam (fast initial progress and minimal efforts in hyperparameter tuning) and  13).The values of φ(W) can well predict the performance of SVD, that matrix with a larger φ(W) will always end up with a larger performance drop after applying SVD.

SGD (good convergence and generalization).
As seen from Figure 2, Adam and Adam_SGD share the same trajectory in the initial steps.After the switching point (around 22400 in Figure 2), Adam_SGD converges to a low error solution (1.28E-06 as shown in Table 3), which is much smaller than the row-based analytic solution (9.87E-06).In contrast, Adam fluctuates in performance and ends with a much larger error 1.06E-05.These phenomena prove the effectiveness of Adam_SGD in solving the weighted Frobenius distance optimization problem in (4) and ( 5).And the reconstruction errors of final solutions obtained by Adam_SGD are 5∼10 times smaller than the row-wise approximations.

Fisher information variance
What is the secret behind TFWSVD's good performance on Transformer-based model compression?
In this part, we utilize the metric Fisher information variance φ(W) introduced in Section 3.3 to reveal the secret by analyzing the sub-structures inside the Transformer blocks.
According to the implementation of Hugging-Face Transformer library (Wolf et al., 2020), there are five kinds of linear layers within the Transformer block, which can be set into two groups by their dimensions: Query, Key, Value, and Multihead Attention layers are matrices with the dimension of 768 × 768; and two feed-forward layers called Intermediate and Output, are 768 × 3072 in dimension.Figure 1 plot the performance changes along with varying the rank ratio for matrices with the dimension of 768 × 768, when only decomposing one type of sub-structure.More results are plotted in Figure 3 in Appendix.
Compared to the overall performance comparison in Section 4.3, the purpose of this experiment is to evaluate the performance of SVD and TFWSVD on the finer-level sub-structures within Transformer blocks.Taking Figure 1a for example, the yellow line denoting "Value" means: only the "Value" substructures are decomposed by SVD, while other types of sub-structures are kept the same as the original model.We calculate the Fisher information variance φ(W) via Equation ( 13), and mark the values besides the corresponding sub-structures.Several observations can be made from Figure 1.
Different matrix has different sensitiveness to SVD.As shown in Figure 1a, Attention_out layer is relatively easy to compress.Even with standard SVD, it can still achieve good performance as low as a rank ratio 0.1.While compressing matrix Intermediate is rather difficult, its performance will drop down to 17% with a rank ratio 0.1.Metric Fisher information variance φ(W) can 'foresee" the performance of SVD.In Figure 1a, decomposing sub-structures with larger φ(W) via SVD will always cause the more serious performance drop.Especially, the performance changes of sub-structure Query and Key are almost identical, and their φ(W) are extremely close (1.16E-03 for Query and 1.17E-03 for Key).This phenomenon implies the metric φ(W) can well reflect the variance of parameter importance within the matrix, and therefore can be a good performance indicator for SVD.TFWSVD can always help improve the performance.Figure 1b shows that applying TFWSVD will bring significant performance gain to all the sub-structures.Especially for the challenging matrix Intermediate (Figure 3b in Appendix), TFWSVD achieves an excellent performance of 60% at a low-rank ratio 0.1, which is a 200% improvement compared to the corresponding SVD performance 17%.

Compress the already compact models
The matrix factorization direction is thought to be orthogonal to other compression methods such as knowledge distillation.But in practice, performance drops are often observed when combining the different lines of compression technologies.Table 4 reports the results of applying TFWSVD, FWSVD, and SVD to compress the lightweight models further.In general, TFWSVD can reduce 30% more parameters for the compact models, with even improved performance.In fact, the performance gains by applying TFWSVD are observed on all compact models in Table 4, while both SVD and FWSVD will cause performance drops more or less when combined with those compact models.These results indicate that SVD and FWSVD may not be fully integrated with other compression technologies due to the the strong assumptions they held.And our TFWSVD can best explore the potential of combining other lines of compression methods with matrix factorization.

Discussion
The incorrect predictions from the trained model will bring larger gradients than the correctly labeled examples, which means these incorrect predictions may be the better choices to compute Fisher information.It is different from our intuitions, but not surprising, since all these examples can reflect the features of trained parameters.In fact, the mislabeled examples may better "describe" the features of the trained model (for example, these examples are around the boundary).Also, we can use incorrect-only labels to estimate the Fisher information to further reduce computation time.More details can be found in Appendix C.

Conclusion
Unlike SVD, there is no closed-form solution for the weighted low-rank estimation problem, which therefore has to be approximated via numerical optimization methods.We managed to obtain the practical solutions through our hybrid Adam_SGD optimizer with the specially designed switching point.Our TFWSVD consistently works better than other low-rank factorization methods (FWSVD, TVD, and SVD).Compared to SOTA methods that requiring expensive generic re-training, our TFWSVD shows more stable performance on various tasks.Also, TFWSVD can efficiently further compress and optimize the already compact models.We also investigate the properties of the targeted matrix, where that SVD may fail, and TFWSVD can be the rescuer.We believe our TFWSVD could be the best alternative to SVD for language model compression.

A Details of tasks and datasets
We include two single sentence tasks: CoLA (Warstadt et al., 2018) measured in Matthew's correlation, SST2 (Socher et al., 2013) measured in classification accuracy; three sentence similarity tasks: MRPC (Dolan et al., 2005) measured in F-1 score, STS-B (Cer et al., 2017) measured in Pearson-Spearman correlation, QQP (Chen et al., 2018b) measured in F-1 score; and three natural language inference tasks: MNLI (Williams et al., 2018) measured in classification accuracy with the average of the matched and mismatched subsets, QNLI (Rajpurkar et al., 2016) measured in accuracy.The token classification task we used is the named entity recognition (NER) on the CoNLL-2003 dataset (Sang andDe Meulder, 2003).In summary, our evaluation includes eight different natural language tasks.

B TVD
In this section, we provide the details about the baseline using first-order Taylor expansion for value decomposition (TVD).Following (Hou et al., 2020;Voita et al., 2019;Molchanov et al., 2019a), we utilize the first-order Taylor expansion as the alternative importance score for matrices: As shown in equation 14a, the intuition behind TVD is that the importance of a parameter w can be calculated by the variation in the training loss when removing this parameter.If we ignore the remainder R w=0 , then we can simply calculate the importance via equation 14c, which is the product of the parameter value and its 1st-order gradient.

C Effect of incorrect predictions
In this section, we evaluate whether the incorrect predictions will have negative impacts on the estimation of Fisher information.In order to achieve this goal, we report the performance of classification tasks in In summary, the wrong labeled examples will generate larger Fisher information, but it doesn't mean that the Fisher information learned from the incorrectly labeled data is "wrong".Instead, the mislabeled examples are better choices than correct predictions, which will produce better results with fewer computations.

D Training Time
This part discusses the training time of different approaches mentioned in this paper.TFWSVD versus FWSVD, TVD, and SVD.First, we compare the time costs of low-rank estimation methods SVD, TVD, FWSVD, and our proposed TFWSVD.In general, SVD is the fastest method that can be done immediately as it has a closeform solution.FWSVD is the second fast method, which needs time for Fisher information calculation.TFWSVD and TVD will cost more time in the numerical optimization process.torizing the weighted matrices through optimizations.The time cost of factorization is decided by the number of parameters in a model, and is fixed for all its downstream tasks.For GLUE tasks trained with the BERT model, TFWSVD will cost 1.5 more V100 GPU hours than FWSVD.
3. TFWSVD versus TVD: TFWSVD and TVD will cost the same time as these approaches are almost the same except for the weighting scheme.
TFWSVD versus generic re-trained models.The generic re-trained compact models such as distil-BERT and MiniLMv2 require a large amount of re-training time.For example, distilBERT needs 720 V100 GPU hours for retraining a pre-trained BERT model.Compared to these methods, our TFWSVD is much faster, since TFWSVD can be applied to the directly downloaded BERT model without expensive re-training.

Figure 1 :
Figure1: The performance of SVD and TFWSVD on the STSB task, when only factorizing a particular type of sub-structures (Key, Query, Value, Attention) in Transformer blocks.The red dash line denotes the original performance.The numbers marked besides the lines are the metric φ(W) calculated by Equation (13).The values of φ(W) can well predict the performance of SVD, that matrix with a larger φ(W) will always end up with a larger performance drop after applying SVD.

Figure 2 :
Figure 2: Numerical experiments comparing Adam and Adam_SGD on STSB.For Adam_SGD, the switching point from Adam to SGD is around step 22400.

Figure 3 :
Figure 3: The performance of SVD and TFWSVD on the STSB task, when only factorizing a particular type of (Intermediate, or Output) in Transformer blocks.

Table 1 :
Results of CoNLL and GLUE benchmark.G-Avg means the average of GLUE tasks, A-Avg denotes the average of all tasks, including CoNLL.Our method is the best performer in terms of both average scores.

Table 2 :
Results of CoNLL and GLUE benchmark with high compression rates.Compared to Table1, the advantages of TFWSVD over other two low-rank estimation methods are enlarged in the high compression rate settings.

Table 3 :
Weighted error and standard error of different methods at their final stages.The weighted error is J(A, B) in (4), and the standard error is ||W − AB|| 2 .Adam and Adam_SGD are trained 50,000 steps, while ALS is trained for 2.5 million steps and SGD is trained for 3 million steps.

Table 4 :
Results of further compressing the compact models.TFWSVD successfully reduces the size of the light-weight models, and achieves slightly better performances than the original compact models.
Table 5, when we use incorrectly/correctly predicted examples to estimate Fisher information.Several observations can be made as follows.First, the final performances are close, no matter using correct-only examples, incorrect-only examples, or all examples.It demonstrates all kinds of examples can somehow reflect the importance of parameters.Meanwhile, the performances using all examples are always the best, confirming the better estimation of empirical Fisher information with more data.Second, using the incorrectonly examples will generate bigger values and better performance than using correct-only predictions.Although only 1-2% examples are incorrectly predicted, choosing these examples to estimate Fisher information will produce close numbers to those generated using all examples.This is because Fisher information is calculated via loss, and incorrect predictions will produce larger losses than the correct predictions.And compared to using correct-only examples, computations through incorrect-only examples may even bring better results for most tasks.
1. FWSVD versus SVD: Compared to SVD, FWSVD needs extra time for Fisher information calculation.The time of this process is similar to one epoch of regular training.For example, SST-2 task in this paper takes about 8 minutes to calculate the Fisher information.This process is generally fast, and it can be further reduced to around 5 seconds if we only use incorrect predictions ( e.g., 1% of all examples, mentioned in Appendix C).

Table 5 :
Performance comparison of using correct/incorrect labeled examples in the estimation of Fisher information.All results here are without fine-tuning.#Examples denotes the number of corresponding examples, I-AVG means the average importance score, F1 and ACC are task specific performance metrics.