Weight Distillation: Transferring the Knowledge in Neural Network Parameters

Knowledge distillation has been proven to be effective in model acceleration and compression. It transfers knowledge from a large neural network to a small one by using the large neural network predictions as targets of the small neural network. But this way ignores the knowledge inside the large neural networks, e.g., parameters. Our preliminary study as well as the recent success in pre-training suggests that transferring parameters are more effective in distilling knowledge. In this paper, we propose Weight Distillation to transfer the knowledge in parameters of a large neural network to a small neural network through a parameter generator. On the WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks, our experiments show that weight distillation learns a small network that is 1.88 2.94x faster than the large network but with competitive BLEU performance. When fixing the size of small networks, weight distillation outperforms knowledge distillation by 0.51 1.82 BLEU points.


Introduction
Knowledge Distillation (KD) is a popular model acceleration and compression approach (Hinton et al., 2015). It assumes that a lightweight network (i.e., student network, or student for short) can learn to generalize in the same way as a large network (i.e., teacher network, or teacher for short). To this end, a simple method is to train the student network with predicted probabilities of the teacher network as its targets.
But KD has its limitation: the student network can only access the knowledge in the predictions of the teacher network. It does not consider the knowledge in the teacher network parameters. These parameters contain billions of entries for the teacher * Authors contributed equally. † Corresponding author.
network to make predictions. Yet in KD the student only learns from those predictions with at most thousands of categories. This way results in an inferior student network, since it learns from the limited training signals. Our analysis in Section 5.1 shows that KD performs better if we simply cut off parts of parameters from the teacher to initialize the student. This fact implies that the knowledge in parameters is complementary to KD but missed. It also agrees with the recent success in pre-training (Yang et al., 2019;Devlin et al., 2019), where parameters reusing plays the main role. Based on this observation, a superior student is expected if all parameters in the teacher network could be exploited. However, this imposes a great challenge as the student network is too small to fit in the whole teacher network.
To fully utilize the teacher network, we propose Weight Distillation (WD) to transfer all the parameters of the teacher network to the student network, even if they have different numbers of weight matrices and (or) these weight matrices are of different shapes. We first use a parameter generator to predict the student network parameters from the teacher network parameters. After that, a finetuning process is performed to improve the quality of the transferred parameters. See Fig. 1 for a comparison of KD and WD.
We test the WD method in a well-tuned Transformer-based machine translation system. The experiments are run on three machine translation benchmarks, including the WMT16 English-Roman (En-Ro), NIST12 Chinese-English (Zh-En), and WMT14 English-German (En-De) tasks. With a similar speedup, the student network trained by WD achieves BLEU improvements of 0.51∼1.82 points over KD. With similar BLEU performance, the student network trained by WD is 1.11∼1.39× faster than KD. More interestingly, it is found that WD is very effective in improving the student net-   work when its model size is close to the teacher network. On the WMT14 En-De test data, our WDbased system achieves a strong result (a BLEU score of 30.77) but is 1.88× faster than the big teacher network.

Transformer
In this work, we choose Transformer (Vaswani et al., 2017) for study because it is one of the stateof-the-art neural models in natural language processing. Transformer is a Seq2Seq model, which consists of an encoder and a decoder. The encoder maps an input sequence to a sequence of continuous representations and the decoder maps these representations to an output sequence. Both the encoder and the decoder are composed of an embedding layer and multiple hidden layers. The decoder has an additional output layer at the end.
The hidden layer in the encoder consists of a self-attention sub-layer and a feed-forward network (FFN) sub-layer. The decoder has an additional encoder-decoder attention sub-layer between the self-attention and the FFN sub-layers. For more details, we refer the reader to (Vaswani et al., 2017).

Knowledge Distillation
KD encourages the student network to produce outputs close to the outputs of the teacher network. KD achieves this by: where L is the cross-entropy loss, y T is the teacher prediction, T is the teacher parameters, y S is the student prediction and S is the student parameters. In practice, Eq. 1 serves as a regularization term. A more effective KD variant for Seq2Seq models is proposed by Kim and Rush (2016). They replace the predicted distributions y T by the generated sequences from the teacher network.

The Parameter Generator
The proposed parameter generator transforms the teacher parameters T to the student parameters S. It is applied to the encoder and decoder separately.
The process is simple: it first groups weight matrices in the teacher network into different subsets, and then each subset is used to generate a weight matrix in the student network. Though using all teacher weights to predict student weights is possible, its efficiency becomes an issue. For instance, the number of parameters in a simple linear transformation will be the product of the numbers of entries in its input and output, where in our case these input and output contain billions of entries (from the teacher and student weights), making it intractable to keep this simple linear transformation in the memory. Grouping is an effective way  Figure 2: A running example of the Parameter Generator. We take the transformation of W 1 in Eq. 2 from the teacher to the student as an example. The teacher (stacked large cubes in the left) contains L t = 6 weights (W 1 ) with each weight from different layers. W 1 (a single cube) in the teacher has an input dimension I t of 512 and an output dimension O t of 2048. The student (stacked small cubes in the right) contains only L s = 2 weights (W 1 ) with input dimension I s = 256 and output dimension O s = 1024.
to reduce it to light-weighted transformation problems. Here we take the encoder as an example for the following discussion.

Weight Grouping
The left of Fig. 2 shows an example of weight grouping for one group with two subsets. Before the discussion, we define the weight class as a weight matrix from the network formulation, and the weight instance as the instantiation of a weight class. Take the FFN for an example. Its formulation is defined as: where W 1 , b 1 , W 2 and b 2 are learnable weight matrices. In this case, W 1 in Eq. 2 defines a weight class. Then all the corresponding weight matrices from FFNs in different layers of the network are the instantiations of this W 1 weight class. From this sense, a weight class determines the role of its instantiations in design, e.g., extracting features for W 1 in Eq. 2. This means that when transferring parameters, different weight classes will contribute little to each other as they have different roles. Therefore, when predicting a student weight matrix, it is sufficient to consider the teacher weight matrices with the same weight class only, which makes the prediction efficient. So our parameter generator groups the teacher weight matrices by the weight class they belong to, i.e., dif-ferent weight classes clusters all their instantiations to form their own groups. In the previous example, the W 1 weight class will form a group [T 1 , T 2 , · · · , T Lt ], where each T i is the W 1 weight instance in the i-th FFN and L t is the number of layers in the teacher network. These weight matrices are then used to generate the W 1 weight instances in the student network.
The parameter generator further divides each group into smaller subsets with weight matrices from adjacent layers, because the adjacent layers function similarly (Jawahar et al., 2019) and so as their weights. This way additionally makes the later transformation more light-weighted. Namely, given a group of L t weight matrices, the parameter generator splits it into L s subsets, where L s is the number of layers in the student network. For example, the i-th subset of the group of W 1 weight class in the previous example will be T (i−1) * Lt/Ls+1 , T (i−1) * Lt/Ls+2 , · · · , T i * Lt/Ls . This subset is used to generate the weight matrix S i , which corresponds to W 1 weight instance in the i-th FFN of the student network.

Weight Transformation
Given a subset of teacher weight matrices, the parameter generator then transforms them to the desired student weight matrix, as shown in the right of Fig. 2.
Let us see the process of generating the weight matrix S ∈ R Is×Os from the subset T 1 , T 2 , · · · , T Lt/Ls with each T i ∈ R It×Ot , where I s and O s are the input and output dimensions of the student weight matrix, I t and O t are the input and output dimensions of the teacher weight matrix. The parameter generator first stacks all weight matrices in this subset into a tensorT ∈ R It×Ot×Lt/Ls . Then it uses three learnable weight matrices, where L = L t /L s . Finally we transformT (with 1 in its shape get eliminated) to produce S, as follows: where W and B are learnable weight matrices of the parameter generator and have the same shape asT . denotes the Hadamard product. The tanh function provides non-linearity. W and B are used to scale and shift the tanh output to any desirable value. Note that we do not share W I , W O , W L , W and B when generating different S. If the encoder is of the same size in both the teacher and student networks, only Eq. 6 is needed to map each weight matrix from the teacher network to the student network.

Training
There are two training phases in WD: In the first phase (Phase 1), we train the parameter generator π = {W I , W O , W L , W, B} to predict the student network S; In the second phase (Phase 2), we finetune the generated student network S to obtain better results. Phase 2 is necessary because the parameter generator is simply a feed-forward network with one hidden layer and thus has no enough capacity to produce a good enough student network at once. A more sophisticated parameter generator is an alternative, but it is expensive due to its large input and output spaces. The task of Phase 1 is to minimize the loss of the student network with parameters S predicted by the parameter generator π from the teacher parameters T . The objective of Phase 1 is: where L is the cross-entropy loss, y T is the teacher prediction, y π is the prediction of the student network generated by the parameter generator π, y is the ground truth, and α is a hyper-parameter that balances two losses and is set to 0.5 by default. The first term of Eq. 7 is the KD loss as in Eq. 1, and the second term is the standard loss. The objective of Phase 2 has the same form as Eq. 7, except that it optimizes S instead of π, like this:S 4 Experiments

Datasets
We evaluate our methods on the WMT16 English-Roman (En-Ro), NIST12 Chinese-English (Zh-En), and WMT14 English-German (En-De) tasks. For the En-Ro task, we use the WMT16 English-Roman dataset (610K pairs). We choose newsdev-2016 as the validation set and newstest-2016 as the test set. For the Zh-En task, we use 1.8M sentence Chinese-English bitext provided within NIST12 OpenMT 1 . We choose the evaluation data of mt06 as the validation set, and mt08 as the test set. For the En-De task, we use the WMT14 English-German dataset (4.5M pairs). We share the source and target vocabularies. We choose newstest-2013 as the validation set and newstest-2014 as the test set.
For all datasets, we tokenize every sentence using the script in the Moses toolkit and segment every word into subword units using Byte-Pair Encoding (Sennrich et al., 2016). The number of the BPE merge operations is set to 32K. We remove sentences with more than 250 subword units (Xiao et al., 2012). In addition, we evaluate the results using multi-bleu.perl.

Model Setup
Our For all machine translation tasks, we experiment with the Transformer-base (base) setting. We additionally run the Transformer-big (big) (Vaswani et al., 2017) and Transformer-deep (deep)  settings on the large En-De dataset. All systems consist of a 6-layer encoder and a 6-layer decoder, except that the Transformerdeep encoder has 48 layers (depth) . The embedding size (width) is set to 512 for Transformer-base/deep and 1,024 for Transformerbig. The FFN hidden size equals to 4× embedding size in all settings. We stop training until the model stops improving on the validation set. All experiments are done on 8 NVIDIA TITIAN V GPUs with mixed-precision training (Micikevicius et al., 2018). At test time, the model is decoded with a beam of width 4/6/4, a length normalization weight of 1.0/1.0/0.6 and a batch size of 64 for the En-Ro/Zh-En/En-De tasks with half-precision. Note that our method can also be seen as an advanced version of Tucker Decomposition (Tucker, 1966). So we also implement a baseline based on Tucker Decomposition. Unfortunately, this model does not converge to a good optima and performs extremely poor.
For the KD baseline, we adopt Kim and Rush (2016)'s method, which has proven to be the most effective for Seq2Seq models (Kim et al., 2019). It generates the pseudo data from the source side of the bilingual corpus. The choices of student networks are based on the observation that the encoder has a greater impact on performance and the decoder dominates the decoding time (Kasai et al., 2020). Therefore we vary the depth and width of the decoder. We test two student network configurations: TINY halves the decoder width and uses a 1-layer decoder (the fastest WD student network with the performance close to the teacher network); SMALL uses a 2-layer decoder whose width is the same as the teacher network (the fastest KD student network with the performance close to the teacher network).
All hyper-parameters of WD are identical to the baseline system, except that WD uses 1/4 warmup steps in Phase 2. For the parameter generator initialization, we use Glorot and Bengio (2010)   tively. All results are the average of three identical runs with different random seeds. Table 1 shows the results of different approaches on different student networks with Transformer-base as the teacher network. In all three tasks and different sized student networks, WD outperforms KD by 0.77, 1.57, and 0.66 BLEU points on En-Ro, Zh-En, and En-De on average. Our method (TINY) can obtain similar performance to the teacher network with only half of its parameters and is 2.57∼2.80× faster, while KD (SMALL) uses more parameters and has only a 1.94∼2.26× speedup in the same case. We attribute the success of WD to that the parameter generator uses parameters of the teacher network to provide a good initialization for the student network, as Phase 1 behaves like the initialization, and the effectiveness of a good initialization has been widely proven (Erhan et al., 2010;Mishkin and Matas, 2016). Interestingly, both KD and WD surpass the teacher network when the stu-dent network size is close to the teacher network (SMALL). This is due to that KD has a form similar to data augmentation (Gordon and Duh, 2019). Table 2 shows the results of larger networks, i.e., Transformer-big/deep. The phenomenon here is similar to that in Table 1. The acceleration on Transformer-big is more obvious than on Transformer-base (2.94× vs. 2.57× for TINY and 2.10× vs. 1.95× for SMALL in WD). This is because the decoder in Transformerbig occupies a larger portion of the decoding time than in Transformer-base. But the acceleration on Transformer-deep is less obvious than on Transformer-base (2.13× vs. 2.57× for TINY and 1.88× vs. 1.95× for SMALL in WD), as a deeper encoder consumes more inference time. Moreover, compared with such a strong Transformer-deep teacher, WD (SMALL) can still outperform it by 1.34 BLEU points with a 1.88× speedup, achieving the state-of-the-art.

Analysis
To better understand WD, we conduct a series of experiments on the NIST12 Zh-En validation set with the Transformer-base teacher.

Initialization Study
To test whether KD misses knowledge in parameters, we initialize the student network with the teacher parameters. If the teacher and student networks have different depths, we initialize the student network with the bottom layers of the teacher network (Sanh et al., 2019). If they have different   widths, we slice the teacher weight matrices to fit the student network . Table 3 shows that initializing the student networks with the teacher parameters improves KD, supporting our claim that knowledge in parameters is complementary to KD but missed. We also see that WD outperforms this simple initialization, which implies that using all teacher parameters helps to obtain a better student.

Sensitivity Analysis
The left part of Fig. 3 studies how sensitive the performance (BLEU) of different methods are to various levels of inference speedup (obtained by varying decoder depth and width). It shows that WD distributes on the upper right of the figure, which means that WD produces student networks that are consistently faster and better. We also investigate how sensitive different methods are to the training hyper-parameters, i.e., the learning rate and warmup steps. Here we focus on Phase 2 of WD, as it directly impacts the final performance. The middle part of Fig. 3 shows that WD can endure learning rates in a wide range, because its performance does not vary much. However, a very large learning rate still negatively impacts the performance. The right part of Fig. 3 is the opposite, where WD is more sensitive to the warmup steps than the learning rate. This is because more warmup steps will run the network with a high learning rate in a longer period. A high learning rate has been proven to be harmful as shown in the middle part of Fig. 3. Table 4 studies which weight matrices in the teacher network are the most effective. It is achieved by training the parameter generator with only the intended weight matrices and without the KD loss term in Eq. 7. We see that using any weight matrix brings a significant improvement over the baseline. This observation shows that weight matrices in the teacher network do contain abundant knowledge. Among these, the encoder weight matrices produce the most significant result, which agrees with the previous study claiming that the encoder is more important than the decoder Bapna et al., 2018).

Compression Study
As the previous experiments focus on a lightweight decoder for acceleration, the compression is limited as the encoder remains large. To examine the effectiveness of WD on model compression, we shrink the depth and width of the encoder and decoder simultaneously. As shown in Table 5 point under various compression ratios (ranging from 1.00× to 3.40×). Note that decreasing the width brings more significant compression. This is because a large portion of the parameters is from the embedding matrices and the output projection. The sizes of these matrices are determined by the width and a fixed vocabulary size.

Training Efficiency
Fig. 4 studies the training efficiency of WD by comparing the final BLEU scores when two training phases end in different epochs. As shown in Fig. 4, Phase 1 has little impact on Phase 2, because Phase 2 converges to optimums with similar BLEU scores once Phase 1 runs for a few epochs (say, 3 epochs). If we run Phase 1 longer, then Phase 2 converges faster. This phenomenon suggests that Phase 1 already transfers the knowledge in the teacher parameters within the first few epochs, and the remaining epochs merely do the fine-tuning (Phase 2) job. This implies that the training of WD is efficient, since we can just train the parameter generator for several epochs first, then fine-tune the generated network as in KD, and finally obtain a much better result than KD.
Though we could train the parameter generator for just a few epochs as suggested, Phase 1 is still time-consuming. The reasons are two folds: 1) the parameter generator consumes a lot of memory and we have to resort to gradient accumulation; 2) the parameter generator involves many large matrix multiplications. For the experiments in Table 1 and  Table 2, it takes us 0.66 days for WD to finish training on average, whereas 0.55 days for the teacher network baseline and 0.31 days for both the student network baseline and KD. 6 Related Work

Knowledge Distillation
Knowledge distillation (Hinton et al., 2015;Freitag et al., 2017) is a widely used model acceleration and compression technique (Jiao et al., 2019;Sanh et al., 2019;. It treats the network predictions as the knowledge learned by the teacher network, since these predicted distributions contain the ranking information on similarities among categories. It then transfers this knowledge to the student network by enforcing the student network to have similar predictions. The followed work extends this idea by providing more knowledge from different sources to the student network. FitNets (Romero et al., 2015) uses not only the predictions but also the intermediate representations learned by the teacher network to supervise the student network. For the Seq2Seq model, Kim and Rush (2016) proposes to use the generated sequences as the sequence-level knowledge to guide the student network training. Moreover, self-knowledge distillation (Hahn and Choi, 2019) even shows that knowledge (representations) from the student network itself can improve the performance.
Our weight distillation, on the other hand, explores a new source of knowledge and a new way to leverage this knowledge. It transfers the knowledge in parameters of the teacher network to the student network via a parameter generator. Therefore, it is orthogonal to other knowledge distillation variants.

Transfer Learning
Transfer learning aims at transferring knowledge from a source domain to a target domain. Based on what knowledge is transferred to the model in the target domain, transfer learning methods can be classified into three categories (Pan and Yang, 2010): instance-based methods reuse certain parts of the data in the source domain (Jiang and Zhai, 2007;Dai et al., 2007); feature-based methods use the representation from the model learned in the source domain as the input (Peters et al., 2018;Gao et al., 2008); parameter-based methods directly fine-tune the model learned in the source domain with the target domain data (Yang et al., 2019;Devlin et al., 2019).
Perhaps the most related work is Platanios et al. (2018)'s work. Their method falls into the parameter-based category. They use a universal parameter generator to share the knowledge among translation tasks. This parameter generator pro-duces a translation model from a given languagespecific embedding. Though we similarly employ the idea of a parameter generator, our weight distillation aims at transferring knowledge from one model to another rather than from one translation task to another. Therefore our parameter generator takes a model instead of a language-specific embedding as its input and is only used once.

Conclusion
In this work, we propose weight distillation to transfer knowledge in the parameters of the teacher network to the student network. It generates the student network from the teacher network via a parameter generator. Our experiments on three machine translation tasks show that weight distillation consistently outperforms knowledge distillation by producing a faster and better student network.
For En-Ro, the training set consists of 0.6M bilingual sentence pairs. The validation set newsdev-2016 contains 1999 pairs and the test set newstest-2016 contains 1999 pairs. For Zh-En, the training set consists of 1.8M bilingual sentence pairs. The validation set mt06 contains 1,664 pairs and the test set mt08 contains 1,357 pairs. For En-De, the training set consists of 4.5M bilingual sentence pairs. The validation set newstest-2013 contains 3,000 pairs and the test set newstest-2014 contains 3,003 pairs. Runtime. To compare the average runtime for each approach,  updates and runtime. For the baseline models (i.e., Teacher, TINY and SMALL) and KD, we record their runtime in the Phase 1 entry because they only need to be trained once.
One can observe that in Table 7, Phase 2 of WD generally consumes similar or less time as well as the number of updates than other approaches. This is because the model is already close to the optimum before the fine-tuning (Phase 2). Table 7 also shows that the number of updates in Phase 1 of WD is much less than other approaches, yet its training time is much longer. This phenomenon is more obvious in Transformer-deep models. This is because one step in Phase 1 of WD is 2.11× slower than in Phase 2 of WD. Decoder. We also investigate how WD's performance (on the validation set) and speed change given different decoder depths and widths. We choose the speed of WD to compute the speedup of different decoder depths and widths. Although the actual speedup of KD will not be exactly the same as the one of WD due to their different decoding results, they are close.
As shown in Table 8, WD is robust to different sized decoders, with both BLEU and speed significantly outperform KD. WD consistently outperforms KD by about 1 BLEU point under various decoder depths and widths. Interestingly, we find that pruning the layers degrades the performance more than shrinking its width, but it provides a higher speedup. Taking the student network with depth 2 and width 512 as an example, if we shrink the depth from 2 to 1, there is a decrease of 1.21 BLEU points in WD but with 1.12× speedup. When we shrink the width from 512 to 256, it leads to a moderate decrease of 0.59 BLEU points yet with only 1.06× speedup. This might be because layers are computed sequentially and wider matrices enjoy the parallel computation acceleration provided by modern GPUs. Loss. In Table 7, we observe that WD generates student networks that are superior to KD. We believe that this is because WD converges to a better   optimum. To examine this hypothesis, we study its loss in Fig. 5. As can be seen, WD does obtain much lower train and valid losses than KD. We also see that Phase 1 already outperforms KD at the end. Given the fact that Phase 1 does the initialization job for Phase 2 and Phase 2 is KD exactly, the way WD works can be treated as providing a good start.