Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel this mystery from an empirical perspective and show that the knowledge comes from the top-1 predictions of teachers, which also helps us build a potential connection between word- and sequence-level KD. Further, we point out two inherent issues in vanilla word-level KD based on this finding. Firstly, the current objective of KD spreads its focus to whole distributions to learn the knowledge, yet lacks special treatment on the most crucial top-1 information. Secondly, the knowledge is largely covered by the golden information due to the fact that most top-1 predictions of teachers overlap with ground-truth tokens, which further restricts the potential of KD. To address these issues, we propose a new method named Top-1 Information Enhanced Knowledge Distillation (TIE-KD). Specifically, we design a hierarchical ranking loss to enforce the learning of the top-1 information from the teacher. Additionally, we develop an iterative KD procedure to infuse more additional knowledge by distilling on the data without ground-truth targets. Experiments on WMT’14 English-German, WMT’14 English-French and WMT’16 English-Romanian demonstrate that our method can respectively boost Transformer_{base} students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperforms the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.


Introduction
In recent years, neural machine translation (NMT) has made marvelous progress in generating high- * Yufeng Chen is the corresponding author.quality translations (Bahdanau et al., 2014;Gehring et al., 2017;Vaswani et al., 2017;Liang et al., 2021bLiang et al., , 2022)), especially with some exquisite and deep model architectures (Wei et al., 2020;Li et al., 2020;Liu et al., 2020;Wang et al., 2022).Despite their amazing performance on translation tasks, high computational and deployment costs still prevent these models from being applied in real life.On this problem, knowledge distillation (KD) (Liang et al., 2008;Hinton et al., 2015;Kim and Rush, 2016;Wu et al., 2020;Chen et al., 2020;Wang et al., 2021;Liang et al., 2021a) is regarded as a promising solution for model compression, which aims to transfer the knowledge from these strong teacher models into compact student models.
Generally, there are two categories of KD techniques, i.e., word-level KD (Hinton et al., 2015;Kim and Rush, 2016;Wang et al., 2021) and sequence-level KD (Kim and Rush, 2016).(1) Word-level KD is conducted on each target token, where it shrinks the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) between the predicted distributions from the student and the soft targets from the teacher.In these soft targets, the knowledge was previously deemed to come from the probability relationship between negative candidates (i.e., the correlation information) (Hinton et al., 2015;Tang et al., 2020;Jafari et al., 2021).
(2) Sequence-level KD instead requires no soft target and directly encourages students to maximize the sequence probability of the final translation decoded by the teacher.Although both techniques work quite differently, they still achieve similarly superior effectiveness.Therefore, we raise two heuristic questions on KD in NMT: • Q1: Where does the knowledge actually come from during KD in NMT?
• Q2: Is there any connection between the wordand the sequence-level KD techniques?
To answer these two questions, we conduct an empirical study that starts from word-level KD to find out where the knowledge hides in the teacher's soft targets and then explore whether the result can be expanded to sequence-level KD.As a result, we summarize several intriguing findings: i.Compared to the correlation information, the information of the teacher's top-1 predictions (i.e., the top-1 information) actually determines the benefit of word-level KD ( §3.1).
ii.The correlation information can be successfully learned by students during KD but fails to improve their final performance ( §3.2).
iii.Extending the top-1 information to top-k does not lead to further improvement ( §3.3).
iv.The top-1 information is important even when the teacher is under-confident in its top-1 predictions ( §3.4).
v. Similar importance of the top-1 information can also be verified on sequence-level KD ( §3.5).
These findings sufficiently prove that 1) the knowledge actually comes from the top-1 information of the teacher during KD in NMT, and 2) the two kinds of KD techniques can be connected from the perspective of the top-1 information.
On these grounds, we further point out that there are two inherent issues in vanilla word-level KD.Firstly, as the source of teachers' knowledge, the top-1 information receives no special treatment in the training objective of vanilla word-level KD since the KL divergence directly optimizes the entire distribution.Secondly, since most top-1 predictions of strong teachers overlap with ground-truth tokens (see the first row of Tab.1), the additional knowledge from teachers beyond the golden information is poor and the potential of word-level KD is largely limited (see the second row of Tab.1).To address these issues, we propose a new KD method named Top-1 Information Enhanced Knowledge Distillation (TIE-KD) for NMT.Specifically, we first design a hierarchical ranking loss that can enforce the student model to learn the top-1 information through ranking the top-1 predictions of the teacher as its own top-1 predictions.Moreover, we develop an iterative KD procedure to expose more input data without ground-truth targets for KD to exploit more knowledge from the teacher.

Datasets
En-De En-Fr En-Ro Top-1 Overlap Rate 68% 78% 94% ∆ from Word-level KD +0.61 +0.13 +0.18 Table 1: The overlap rates between the top-1 predictions of teachers and ground-truth tokens on WMT'14 English-German (En-De), WMT'14 English-French (En-Fr) and WMT'16 English-Romanian (En-Ro) and the corresponding improvement (∆) of BLEU scores brought by word-level KD on the test set of these tasks1 .
We evaluate our TIE-KD method on three WMT benchmarks, i.e., WMT'14 English-German (En-De), WMT'14 English-French (En-Fr) and WMT'16 English-Romanian (En-Ro).Experimental results show that our method can boost Transformer base students by +1.04, +0.60, +1.11 BLEU scores and significantly outperforms the vanilla word-level KD approach.Besides, we test the performance of existing KD techniques in NMT and our TIE-KD under different teacher-student capacity gaps and show the stronger generalizability of our method on various gaps.
Our contributions are summarized as follows2 : • To the best of our knowledge, we are the first to explore where the knowledge hides in KD for NMT and unveil that it comes from the top-1 information of the teacher, which also helps us build a connection between wordand sequence-level KD.
• Further, we point two issues in vanilla wordlevel KD and propose a novel KD method named Top-1 Information Enhanced Knowledge Distillation (TIE-KD) to address them.Experiments on three WMT benchmarks demonstrate its effectiveness and superiority.
• We investigate the effects of current KD techniques in NMT under different teacher-student capacity gaps and show the stronger generalizability of our approach to various gaps.

Neural Machine Translation
Given a source sentence with M tokens x = {x 1 , x 2 , . . ., x M } and the corresponding target sen- tence with N tokens y = {y 1 , y 2 , . . ., y N }, NMT models are trained to maximize the probability of each target token conditioning on the source sentence by the cross-entropy (CE) loss: where y * j and y <j denote the ground-truth target and the target-side previous context at time step j, respectively.And θ is the model parameter.

Word-level Knowledge Distillation
Word-level KD (Kim and Rush, 2016) aims to minimizes the KL divergence between the output distributions of the teacher model and the student model on each target token.Formally, given the probability distribution q(•) from the teacher model, the KL divergence-based loss is formulated as follows: D KL q(y j |y <j , x; θ t ) p(y j |y <j , x; θ s ) , where θ t and θ s denote the model parameters of the teacher and the student, respectively.Then, the overall loss function of word-level KD is the linear interpolation between the CE loss and the KL divergence loss: (3)

Sequence-level Knowledge Distillation
Sequence-level KD (Kim and Rush, 2016) encourages the student model to imitate the sequence probabilities of the translations from the teacher model.
To this end, it optimizes the student model through the following approximation: where Y denotes the hypothesis space of the teacher and y is the approximate result through the teacher's beam search.
3 Probing the Knowledge of KD in NMT In this section, we start from word-level KD and offer exhaustive empirical analyses on 1) the determining information in word-level KD ( §3.1); 2) whether the correlation information has been learned ( §3.2); 3) whether there are more benefits when extending the top-1 to top-k information ( §3.3) and 4) the importance of the top-1 information on soft targets with different confidence ( §3.4).
Then we expand the conclusion to sequence-level KD ( §3.5) and lastly revisit KD for NMT from a novel view ( §3.6).

Which Information Determines the
Performance of Word-level KD?
In word-level KD, the relative probabilities between negative candidates in the soft targets from the teacher contain rich correlation information, which is previously deemed to carry knowledge from the teacher (Hinton et al., 2015;Tang et al., 2020;Jafari et al., 2021).However, in practice, strong teachers usually have high confidence in their top-1 predictions while retain little probability mass for other candidates.Hence, to study the mystery of KD, it is necessary to first investigate the real effects of the correlation information and the top-1 prediction information during KD and then figure out which one actually determines the performance of KD.
To this end, during word-level KD, we separately remove the top-1 information and the correlation information from the original soft targets of the teacher (as depicted in Fig. 1) and then observe the corresponding performance.Besides the BLEU score, we also introduce a new metric, namely the Top-1 Agreement (TA) rate, which calculates the overlap rate of the top-1 predictions between the student and the teacher on each position under the teacher-forcing mode.As shown in Tab.2, the performance slightly increases when we remove the probabilities of all other candidates except for the top-1 ones in soft targets (see Fig. 1(b))3 .However, when we only remove the top-1 information and keep the remaining correlation information (see Fig. 1(c))4 , the performance of KD drops close to the baseline without any KD.Moreover, we observe that the TA rates are well correlated with the final BLEU scores among these students.Therefore, we conjecture that the top-1 information is the one that actually determines the performance of word-level KD (answer to Q1).

Can Student Models Really Learn the Correlation Information?
To further confirm the above conjecture, we examine whether the student models have successfully learned the correlation information of the teacher during KD.To achieve this, we design two metrics to measure the ranking similarities between token rankings from the student and the teacher, named top-k edit distance and top-k ranking distance.
Top-k Edit Distance.Given the top-k predictions of the teacher at time step j as [y t 1 j , ..., y t k j ] and the ones of the student as [y s 1 j , ..., y s k j ], the Table 3: Ranking similarities between the students and the teachers and the corresponding BLEU scores (%)5 .
top-k edit distance can be expressed as: where f (•, •) calculates the edit distance.
Top-k Ranking Distance.For each y t i j in [y t 1 j , ..., y t k j ], this metric measures the average ranking distance between its original rank i from the teacher, and the corresponding rank from the student, denoted as r s (y t i j ): We compare the students above based on these two metrics and list the results in Tab.3.Clearly, the students perform better on both D edit and D rank when the soft targets contain correlation information ((a),(c) vs. (b),(d)), indicating that student models can successfully learn the correlation information from the teacher.However, this ranking performance fails to bring better performance of KD, as measured by BLEU scores.Thus, these results negate the previous perception that the correlation information carries the knowledge during KD, which also supports our conjecture in Sec.3.1.

Does Knowledge Increase with Top-k
Information?
As the importance of the top-1 information for transferring knowledge in word-level KD has been validated, we further investigate whether more knowledge can be exploited by extending top-1  give a negative answer that more information does not bring significantly more knowledge.Thus, we can believe that almost all the knowledge of the teacher in word-level KD comes from the teacher's top-1 information, even though the whole distribution is distilled to the student.

Does Top-1 Information Work in All Soft
Targets?
Although the previous results have coarsely located the knowledge in word-level KD on the top-1 information of the teacher, it is still not clear whether this holds for all types of soft targets, especially when the teacher is under-confident in its top-1 predictions.Towards this end, we divide the soft targets of the teacher into three intervals (Wang et al., 2021) based on their top-1 probabilities: (0.0, 0.4], (0.4, 0.7], and (0.7, 1.0).Then we separately conduct the same KD processes as described in Fig. 1, only using the soft targets in one of these intervals.Surprisingly, the results in Fig. 2 show that even when the teacher is not so confident (i.e., q max ≤ 0.7) in its top-1 predictions, using only the top-1 information (i.e., the blue bars) still achieves better performance than using the full information Table 5: BLEU scores (%) of sequence-level KD on the validation set of the WMT'14 En-De task when we separately use the top-1 and the non-top-1 targets of the teacher in the teacher's translations during KD.
in corresponding soft targets.However, in these cases, removing the top-1 information in soft targets largely degrades the performance of the students.We conjecture that these under-confident top-1 predictions of the teacher can serve as hints for students to learn the difficult ground-truth labels, while the correlation information in these cases carries more noise than real knowledge for students.

Expanding to Sequence-level KD
Inspired by the analyses on word-level KD, we move on to sequence-level KD and decompose its loss function in Eq.( 4) into a word-level form: where y j is the teacher-decoded target for students at time step j.Considering the similar word-level form, it is intuitive to speculate that the top-1 information may also matter in sequence-level KD.To verify this, we divide the targets y j into the top-1 and the non-top-1 predictions of the teacher 7 and investigate the respective effects of these targets by separately using them during sequence-level KD.As shown in Tab.5, there is only a negligible performance change when we only use the top-1 targets for KD (row 1 vs. row 2).However, if we only use the non-top-1 targets, the BLEU score drastically drops (row 1 vs. row 3).Moreover, considering the different proportions of the two kinds of targets in the teacher's translations (i.e.,70% vs. 30%), we also use a fixed part (the same amount as the non-top-1 targets) of the top-1 targets for a fair comparison, and the performance is still steady (row 2 vs. row 4) and much better than using only the non-top-1 targets (row 3 vs.row 4).Interestingly, by adding additional word-level top-1 information to the non-top-1 part, the performance of sequence-level KD further improves (row 1 vs. row 5).Therefore, we can also confirm the importance of the top-1 information in sequence-level KD.
3.6 Rethinking KD in NMT from the Perspective of the Top-1 Information Through the above analyses, we verify the importance of the teacher's top-1 information on both KD techniques, which actually reflects a potential connection between them.A brief theoretical analysis on this connection is provided in Appendix A.
In short, the two kinds of techniques share a unified objective that imparts the teachers' top-1 predictions to student models at each time step.Thus, we believe that they are well connected on their similar working mechanisms (answer to Q2).
Further, we revisit word-level KD from this perspective and find two inherent issues.Firstly, the KL divergence-based objective in vanilla wordlevel KD directly optimizes whole distributions of students, while lacking specialized learning of the most important top-1 information.Secondly, since the top-1 predictions of the teacher mostly overlap with the ground-truth targets, the knowledge from the teacher is largely covered by the ground-truth information, which largely limits the potential of word-level KD.Therefore, we claim that the performance of the current word-level KD approach is far from perfect and the solutions to these problems are urgently needed.

Top-1 Information Enhanced Knowledge Distillation for NMT
To address the aforementioned issues in wordlevel KD, in this section, we introduce our method named Top-1 Information Enhanced Knowledge Distillation (TIE-KD), which includes a hierarchical ranking loss to boost the learning of the top-1 information from the teacher ( §4.1) and an iterative knowledge distillation procedure to exploit more knowledge from the teacher ( §4.2).

Hierarchical Ranking Loss
To help student models better grasp the top-1 information during distillation, we design a new loss named hierarchical ranking loss.To gently achieve this goal, we first encourage the student to rank the teacher's top-k predictions as its own top-k predic-

Algorithm 1 Iterative Knowledge Distillation
Input: source and target data in current mini-batch (x, y); student model S; teacher model T ; iteration times N ; 1: Initialize y 0 = y; L kd = 0; 2: Compute L ce based on Eq.(1) 3: for i in 1, 2, ..., N do 4: probability distributions from the student model 5: probability distributions from the teacher model 6: Compute L i kd (p i , q i ) based on Eq.( 7) 7: L kd ← L kd + L i kd 8: student predictions as inputs in the next iteration 9: end for 10: L word-kd ← (1 − α)L ce + α N L kd tions and then rank the teacher's top-1 prediction over these top-k predictions.Formally, given the student's top-k predictions as [y s 1 j , ..., y s k j ] and the teacher's top-k predictions as [y t 1 j , ..., y t k j ], the hierarchical ranking loss L hr can be expressed as: 1{q(y tu j ) > q(y sv j )}(p(y sv j ) − p(y tu j )) where p(•) and q(•) are the probabilities from the student model and the teacher model, respectively.
And 1{•} is an indicator function.
In this way, the student model can be enforced to rank the top-1 predictions of the teacher to its own top-1 places, and thus it can explicitly enhance the learning of the knowledge from the teacher.Then, we add this loss to the original KL divergence loss, i.e., Eq.( 2), forming a new loss for KD: (7)

Iterative Knowledge Distillation
Given that the large overlap between the top-1 predictions and ground-truth targets limits the amount of additional knowledge from the teacher during word-level KD, introducing data without groundtruth targets for KD could be helpful to mitigate this issue.Inspired by previous studies on decoder-side  (Rei et al., 2020) scores (%) on three translation tasks.Results with † are taken from the original papers.Others are our re-implementation results using the released code with the same setting in Sec.5.2 for a fair comparison.We report average results over 3 runs with random initialization.Results with * are statistically (Koehn, 2004) better than the vanilla Word-KD with p < 0.01.
Specifically, as shown in Algorithm 1, at each training step, we conduct KD for N iterations (line 3), by using the predictions of the student in the current iteration as the decoder-side inputs for KD in the next iteration (line 8).Generally, these predictions can be regarded as similar but new inputs compared to the original target inputs.Meanwhile, there is no ideal ground-truth target for these inputs since they are usually not well-formed sentences.Then during each iteration, we collect the loss of KD according to Eq.( 7) (lines 4∼7) and average it across all the iterations (line 10).Since all the supervision signals are from the teacher after the first iteration, the knowledge of the teacher model will be more exploited during the following iterations and thus the potential of word-level KD can be more released.

Datasets
We conduct experiments on three commonly-used WMT tasks, i.e., the WMT'14 English to German (En-De), WMT'14 English to French (En-Fr) and WMT'16 English to Romanian (En-Ro).For all these tasks, we share the source and the target vocabulary and segment words into subwords using byte pair encoding (BPE) (Sennrich et al., 2016) with 32k merge operations.More statistics of the datasets can be found in Appendix C.1.

Implementation Details
All our experiments are conducted based on the open-source toolkit fairseq (Ott et al., 2019) with FP16 training (Ott et al., 2018).By default, we follow the big/base setting (Vaswani et al., 2017) to implement the teacher/student models in our experiments.More training and evaluation details can be referred to Appendix C.2.For word-level KD-based methods, we set the α in Eq.(3) to 0.5 following Kim and Rush (2016).For our method, we set top-k in Sec.4.1 to 5 and iteration time N in Sec.4.2 to 3 on all three tasks.The selection of top-k and N are shown in Appendix D.

Main Results
We compare our proposed method with existing KD techniques in NMT (the detailed description of these compared techniques can be referred to Appendix C.3) on three WMT tasks.To make the results more convincing, we report both BLEU and COMET (Rei et al., 2020) scores in Tab.6.Using Transformer big as the teacher, our method can boost the Transformer base students by +1.04/+0.60/+1.11BLEU scores and +4.52/+2.57/+4.80COMET scores on three tasks, respectively.Compared to the vanilla Word-KD baseline, our method can outperform it significantly on all translation tasks, which verifies the effectiveness of our proposed solutions.Additionally, as a word-level KD method, our TIE-KD can outperform Seq-KD on all three tasks and even achieves fully competitive results with the teacher on En-Ro, which demonstrates that the potential of Word-KD can be largely released by our method.

Ablation Study
To separately verify the effectiveness of our solutions for the two issues in vanilla word-level KD, we conduct an ablation study on WMT'14 En-De task and record the results in Tab.7.When only adding hierarchical ranking loss to vanilla wordlevel KD, the BLEU scores and the TA rates gain by +0.3/+0.22 and +0.32/+0.47 on the validation/test set, respectively.It reflects that KL divergence only provides a loose constraint on the learning of the top-1 information from the teacher, while our hierarchical ranking loss helps to explicitly grasp this core information.When only using iterative KD, the student also improves by +0.36/+0.25 BLEU score and +0.18/0.28TA rates.It indicates that our iterative KD can effectively release the potential of word-level KD by introducing data without groundtruth targets.When combined together, the two solutions finally compose our TIE-KD and can yield further improvement on both metrics.Therefore, the two issues in word-level KD are orthogonal and our proposed solutions are complementary to each other.

Combination With Sequence-Level KD
According to (Kim and Rush, 2016), word-level KD can be well combined with sequence-level KD and yields better performance.As a word-level KD approach, our TIE-KD can also theoretically be combined with sequence-level KD.We verify this on the WMT'14 En-De task and list the results in Tab.8.Like Word-KD, our TIE-KD can also achieve better performance when combined with Seq-KD and is also better than "Word-KD + Seq-KD", indicating the superiority of our method and its high compatibility with sequence-level KD.

Can a Stronger Teacher Teach a Better
Student in NMT?
Among the prior literature on KD (   Guo et al., 2020;Jafari et al., 2021;Qiu et al., 2022), a general consensus is that a large teacher-student capacity gap may harm the quality of KD.We also check this problem in NMT by using teachers of three model sizes.Besides the default configuration (i.e., Transformer big ) in our experiments above, we also add Transformer base setting as the weaker teacher and Transformer deep-big setting with 18 encoder layers and 6 decoder layers as the stronger teacher8 .We compare our method with word-and sequence-level KD under these teachers in Fig. 3 and draw several conclusions: (1) The stronger teacher can bring improvement to sequence-level KD but fails to word-level KD, where the reason may be the less additional knowledge from the stronger teacher due to its higher top-1 accuracy (68%→70%).
(2) As a word-level KD method, our TIE-KD instead brings conspicuous improvement with the stronger teacher, indicating that our method can exploit more knowledge from the teacher.
(3) Under the weaker teacher, the student from our method even significantly surpasses the teacher, while other methods are largely limited by the performance of the teacher, demonstrating the high generalizability of our TIE-KD to different teacher-student capacity gaps.
6.4 Why is the Top-1 Information Important in KD?
The decoding process of language generation models can be regarded as a sequential decision-making process (Yu et al., 2017;Arora et al., 2022).As mentioned in Sec.3.5, during decoding, beam search tends to pick the top-1 predictions of the NMT model on each beam and finally selects the most probable beam.Thus, the top-1 information (including both the top-1 word index and its corresponding probability) of the teacher model largely represents its decision on each decoding step, which is exactly what we expect the student model to learn from the teacher through KD in NMT.Therefore, the top-1 information can be seen as the embodiment of the knowledge of the teacher model in NMT tasks and should be emphatically learned by the student models.
7 Related Work Kim and Rush (2016) first introduce word-level KD for NMT and further propose sequence-level KD for better performance.Afterward, Wang et al. (2021) investigate the effectiveness of different types of tokens in KD and propose selective KD strategies.Moreover, Wu et al. (2020) distill the internal hidden states of the teacher models into the students and also obtain promising results.In the field of non-autoregressive machine translation (NAT), KD from autoregressive models has become a de facto standard to improve the performance of NAT models (Gu et al., 2017;Zhou et al., 2019;Gu et al., 2019).Also, KD has been used to enhance the performance of multilingual NMT (Tan et al., 2019;Sun et al., 2020).Besides, similar ideas can be found when introducing external information to NMT models.For example, Baziotis et al. (2020) use language models as teachers for low-resource NMT models.Chen et al. ( 2020) distill the knowledge from fine-tuned BERT into NMT models.Feng et al. (2021) and Zhou et al. (2022) leverage KD to introduce future information to the teacher-forcing training of NMT models.
Differently, in this work, 1) we aim to explore where the knowledge hides in KD and unveil that it comes from the top-1 information of the teacher and further improve KD from this perspective; 2) we try to build a connection between two kinds of KD techniques in NMT and reveal their common essence, providing new directions for future work.

C.1 Statistics of the Datasets
For the En-De task, the training data contains nearly 4.5M sentence pairs.We choose new-stest2013 and newstest2014 as the validation set and the test set, respectively.For the En-Fr task, there totally remains 35.8M sentence pairs after the cleaning procedure.Then we choose newstest2013 and newstest2014 as the validation set and the test set, respectively.For the En-Ro task, we directly use the pre-processed data from Mehta et al. (2020) and there are about 608K sentence pairs in the training data.Then newsdev2016 is selected as the validation set and newstest2016 is the test set.The overall statistics of the datasets are listed in Table 9. teachers and enlarge the gaps between teacher models and student models, we train teachers for 50% more steps than the corresponding students.Then we use the checkpoint with the highest BLEU of the teacher on the validation set to conduct distillation.

Dataset
Evaluation.During inference, we set beam size to 4 and length penalty to 0.6 for En-De and En-Fr.For En-Ro, we set beam size to 5 and length penalty to 1.2.For a more convincing evaluation, we use multibleu.perlto calculate case-sensitive BLEU and unlabel-comet 9 to calculate COMET scores (Rei et al., 2020) for all three tasks.For student models, we average the last 5 checkpoints for evaluation following Vaswani et al. (2017).We use the paired bootstrap resampling methods (Koehn, 2004) for the statistical significance test.For the En-De task and the En-Fr task, we evaluate and save the checkpoint every 5000 training steps.For the En-Ro task, since the models tend to overfit, we only train students for 20 epochs and save the checkpoint after every epoch.

C.3 Compared Systems and Hyperparameters
Transformer.We follow the standard base/big model configurations (Vaswani et al., 2017) to implement the student/teacher models.
Seq-KD.Kim and Rush (2016) also propose a sequence-level KD approach that directly substitutes the original target-side training data with the translations of the teacher from beam search.In our experiments, the hyperparameters of beam search keep the same with the inference stage.
BERT-KD.Chen et al. (2020) propose to distill the knowledge from BERT (Devlin et al., 2018) for text generation tasks.
Seer Forcing.Feng et al. (2021) design a seer forcing method for NMT to distill future information to the teacher forcing.Following the suggestion in (Feng et al., 2021), we set the α in their paper to 0.5 for both En-De and En-Fr, and 0.25 for En-Ro.Besides, we set the seer dropout to 0.1 for En-De and En-Fr and 0.2 for En-Ro.
CBBGCA.Zhou et al. (2022) also propose to distill bi-directional contextual information in CMLM for uni-directional training of NMT based on the confidence of the NMT model.
Annealing KD.Our implementation of the method in (Jafari et al., 2021) which gradually anneals the temperature of the teacher during KD.Different from the original paper, we use the KL divergence as the loss function of KD instead of Mean Squared Error (MSE) due to its better performance on NMT tasks.In our carefully chosen recipe, we set the max temperature to 1.1 and gradually reduce it to 1.0 during the first 2/3 epochs.
Then we use vanilla CE loss to train the student model for the remaining 1/3 epochs.
Selective-KD.Wang et al. (2021) investigate the effectiveness of different data for distillation and propose a knowledge selection method for selecting more valuable data for word-level KD.In our experiments, we choose the global-level selection that performs better according to Wang et al. (2021).

D Hyperparameter Selection
D.1 Effect of Hierarchical Ranking Range k In this section, we investigate the effect of k in hierarchical ranking loss on our method.We search k in [3,5,10,20] and compare their performance on the validation set of the WMT'14 En-De task.
As shown in Fig. 4, our method performs best when k is set to 5. Thus, we keep k to 5 for all three tasks in our experiments.

D.2 Effect of Iteration Times N
Since our method includes several iterations of KD, we further investigate the effects of the iteration times on the performance of our method.Intuitively, with more iteration times, more knowledge will be exploited from the teacher, while the computational cost will also increase.To check this, we try each iteration time in [1,2,3,4] and record the corresponding performance and training time in Fig. 5.It is obvious that the performance of our method gradually improves with N increasing, while the training time per step also linearly increases.Balancing the cost and the performance, we choose 3 as the final iteration time.
Figure 1: Removing different information from the original soft targets provided by the teacher during word-level KD.Note that the soft target in "w/o KD" is equivalent to the soft target of label smoothing.

Figure 2 :
Figure 2: BLEU scores (%) of KD with different information in three intervals of soft targets on the validation set of the WMT'14 En-De task.

Figure 3 :
Figure 3: Performance of KD techniques with different teacher models on the test set of the WMT'14 En-De task.

Figure 4 :
Figure 4: BLEU scores of our method with different k on the validation set of the WMT'14 En-De task.

Figure 5 :
Figure 5: BLEU scores of our method with different iteration times N on the validation set of the WMT'14 En-De task and the corresponding training costs.

Table 2 :
Top-1 Agreement rates (%) and BLEU scores (%) of different soft targets during KD on the validation sets of the three tasks.Deeper colors represent better performance on the corresponding metrics.

Table 4 : BLEU scores (%) of word-level KD with topk information on the validation set of the three tasks.|V | is the vocabulary size.

Table 8 :
Combination with sequence-level KD and word-level KD methods on the WMT'14 En-DE task.

Table 9 :
Statistics of the datasets for three WMT tasks.

Table 10 :
Training hyperparameters and model configurations of our experiments.