Continual Knowledge Distillation for Neural Machine Translation

While many parallel corpora are not publicly accessible for data copyright, data privacy and competitive differentiation reasons, trained translation models are increasingly available on open platforms. In this work, we propose a method called continual knowledge distillation to take advantage of existing translation models to improve one model of interest. The basic idea is to sequentially transfer knowledge from each trained model to the distilled model. Extensive experiments on Chinese-English and German-English datasets show that our method achieves significant and consistent improvements over strong baselines under both homogeneous and heterogeneous trained model settings and is robust to malicious models.


Introduction
Current neural machine translation (NMT) systems often face such a situation: parallel corpora are not publicly accessible but trained models are more readily available.On the one hand, many data owners are usually unwilling to share their parallel corpora with the public for data copyright, data privacy and competitive differentiation reasons, leading to recent interests in federated learning for NMT (Wang et al., 2021b;Roosta et al., 2021).On the other hand, trained NMT models are increasingly available on platforms such as Hugginface (https://huggingface.co) and Opus-MT (https://opus.nlpl.eu/Opus-MT)since these models can be directly used without public access to the original training data.
As a result, a question naturally arises: can we take advantage of increasingly available trained NMT models to enhance one NMT model of interest?In this work, we propose a method called Continual Knowledge Distillation (CKD) to address t < l a t e x i t s h a 1 _ b a s e 6 4 = " T i 3 m F P D 9 B b 2 1 b I n b 3 G 7 O W 2 M X M f Y = " > A A A C q n i c b Z H L T h s x F I a d 4 V I a b o E u u 7 G I k J B A 0 Q x q C 0 s E m 6 r d 0 I p w U S a K P B 5 P Y m F 7 L P u 4 a j S a F Y / B F h 6 K t 6 k n G R W S c C R L v 7 7 z n 4 t 8 E i 2 4 h T B 8 a Q R L y y u r H 9 Y + N t c 3 N r e 2 W z u 7 1 z Z 3 h r I u z U V u b h N i m e C K d Y G D Y L f a M C I T w W 6 S + 4 s q f / O H G c t z d Q V j z f q S D B X P O C X g 0 a C 1 H S e y i P W I l 4 M C D q N y 0 G q H n X A S e F F E t W i j O i 4 H O 4 2 H O M 2 p k 0 w B F c T a X h R q 6 B f E A K e C l c 3 Y W a Y J v S d D 1 v N S E c l s v 5 h s X u J 9 T 1 K c 5 c Y / B X h C 3 1 Y U R F o 7 l o l 3 S g I j O 5 + r 4 H u 5 n o P s t F 9 w p R 0 w R a e D M i c w 5 L j 6 B p x y w y i I s R e E G u 5 3 x X R E D K H g P 2 t m y s W P n 1 W z c h 5 q p y g s U O X 8 s j M w k W W z G a c s i 4 k Z S v K 3 i K t t c 1 3 E R u I p K s v 9 V w d X C w 6 u / j v m G 9 T V / m z R / J E W x f V x J / r W + f r r S / v s v D 7 g G v q M 9 t A B i t A J O k P f 0 S X q I o o c e k R P 6 D k 4 C n 4 H d 0 F v a g 0 a d c 0 n N B N B + g / 6 / 9 V e < / l a t e x i t > t+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " k F F w a 7 O O f s 3 K O S i 3 n L m j T j r b 4 Z g = " > A A A C q n i c b Z H L T h s x F I a d 4 V I a b o E u u 7 G I k F h A N I P a w h L B p m o 3 t C J c l I k i j 8 e T W N g e y z 6 u G o 1 m x W O w h Y f i b e p J R o U k H M n S r + / 8 5 y K f R A t u I Q x f G s H S 8 s r q h 7 W P z f W N z a 3 t 1 s 7 u t c 2 d o a x L c 5 G b 2 4 R Y J r h i X e A g 2 K 0 2 j M h E s J v k / q L K 3 / x h x v J c X c F Y s 7 4 k Q 8 U z T g l 4 N G h t x 4 k s Y j 3 i 5 a C A o 6 g c t N p h J 5 w E X h R R L d q o j s v B T u M h T n P q J F N A B b G 2 F 4 U a + g U x w K l g Z T N 2 l m l C 7 8 m Q 9 b x U R D L b L y a b l 3 j f k x R n u f F P A Z 7 Q t x U F k d a O Z e K d k s D I z u c q + F 6 u 5 y A 7 7 R d c a Q d M 0 e m g z A k M O a 6 + A a f c M A p i 7 A W h h v t d M R 0 R Q y j 4 z 5 q Z c v H j Z 9 W s n I f a K Q o L V D m / 7 A x M Z N l s x i n L Y m K G k v w t 4 m r b X B e x k X i K y n L / 1 c H V g o O r / 4 7 5 B n W 1 P 1 s 0 f 6 R F c X 3 c i b 5 1 v v 7 6 0 j 4 7 r w + 4 h j 6 j P X S A I n S C z t B 3 d I m 6 i C K H H t E T e g 4 O g 9 / B X d C b W o N G X f M J z U S Q / g P / b 9 V g < / l a t e x i t > t 1 this problem for NMT.As shown in Figure 1, we assume that multiple trained NMT models (i.e., teachers) are available to "educate" one NMT model of interest (i.e., student) in a sequential manner, which means that teacher models to arrive in the future are not accessible at the current time step.We also assume that the training set of the student model, a transfer set, and a test set are available, but the training set of the teachers are unavailable.CKD aims to continually improve the translation performance of the student model on the test set by sequentially distilling knowledge from each incoming teacher model to the student model.

< l a t e x i t s h a 1 _ b a s e 6 4 = " D O U h p x t B T o T B e l W f U c n H p b h 5 l N Y = " >
As its name suggests, CKD is an intersection of knowledge distillation (Hinton et al., 2015) and continual learning (Kirkpatrick et al., 2017).On the one hand, CKD differs from standard knowledge distillation in that the knowledge is transferred from teacher models to the student model asynchronously instead of synchronously.As a result, the knowledge transferred to the student model from previous teacher models can be overridden by an incoming teacher model, which is often referred to as the catastrophic forgetting problem (Kirkpatrick et al., 2017).The situation aggravates when not all teacher models convey knowledge benefi-arXiv:2212.09097v2[cs.CL] 12 Jun 2023 cial to the student model.On the other hand, CKD is different from conventional continual learning methods by focusing on learning one task (i.e., enhancing the student model) rather than learning many different tasks.The learning process is still very challenging as compared with standard continual learning because the original training data of teacher models is inaccessible to the student model.Consequently, we have to resort to knowledge distillation at each time step to make the most of teacher models.
To address these aforementioned challenges, we propose to fuse two knowledge sources for the student model at each time step: filtering the new knowledge from the current teacher model (i.e., knowledge filtration) and inheriting the old knowledge from the previous student model (i.e., knowledge inheritance) simultaneously.Experimental results show that our method significantly and consistently outperforms strong baselines under both homogeneous and heterogeneous teacher settings for Chinese-to-English and German-to-English translation.And it is also robust to malicious teachers.

Problem Statement
Let Θ = {θ * 1 , θ * 2 , . . .} be a sequence of frozen trained NMT models (i.e., teacher models), where θ * t denotes the t-th teacher model.Let ϕ 0 be an NMT model of interest (i.e., student model) and ϕ t be the student model at time step t.We use x = x 1 , . . ., x I to denote a source-language sentence and y = y 1 , . . ., y J to denote a target-language sentence.We use y <j = y 1 , . . ., y j−1 to denote a partial translation.D train = {⟨x (m) , y (m) ⟩} M m=1 represents the training set of the student model.D trans = {⟨x (n) , y (n) ⟩} N n=1 represents the transfer set that a teacher model uses to "educate" the student model.D test is a test set used to evaluate the student model.We use BLEU(D test , ϕ t ) to denote the BLEU score the student model at time step t obtains on the test set.
Given an initial student model ϕ 0 , our goal is to maximize BLEU(D test , ϕ t ) by taking advantage of Θ, D train , and D trans .

Training Objective
As shown in Figure 1, the student model ϕ t at time step t is determined by the current teacher model θ * t that encodes new knowledge and the previous learned student model φt−1 that encodes previously learned knowledge.Therefore, the overall training objective of CKD is composed of three loss functions: where ℓ CE (ϕ t , D train ) is the standard cross entropy loss defined as m) is the length of the m-th target sentence y (m) .In Eq. 1, ℓ KF (ϕ t , θ * t , D trans ) is a knowledge filtration loss (see Sec. 2.3) that filters the knowledge transferred from θ * t , ℓ KI (ϕ t , φt−1 , D trans ) is a knowledge inheritance loss (see Sec. 2.4) that inherits the knowledge transferred from φt−1 , and λ is a hyper-parameter that balances the preference between receiving new and inheriting old knowledge.
Therefore, the learned student model at time step t can be obtained by (3)

Knowledge Filtration
In standard knowledge distillation (Hinton et al., 2015), an important assumption is that the teacher model is "stronger" than the student model, which means that the teacher model contains knowledge that can help improve the student model.Unfortunately, this assumption does not necessarily hold in our problem setting because it is uncertain what the next incoming teacher model will be.As a result, there are two interesting questions: 1. How do we know whether the teacher model contains knowledge useful to the student model?
2. How do we locate and transfer the useful knowledge from the teacher model to the student model?
Intuitively, the teacher and the student can do the same "test paper" in order to find where the An example that illustrates how to find where a teacher model can help a student model.Given a sentence pair of the transfer set, both the teacher and student models try to predict a target word given the source sentence and the partial translation.How well a model predicts can be quantified as a real-valued number.The target words on which the teacher performs better than the student are highlighted in red.Other words are highlighted in blue.
teacher can help the student.Figure 2 shows an example.Given a (Romanized) Chinese sentence and its English translation, both the teacher and student models predict every target word y j given the source sentence x and the partial translation y <y .
The quality of a prediction can be quantified as a real-valued number.If the teacher model performs better than the student model on a target word (e.g., "Prof."), it is likely that the teacher model contains knowledge useful to the student model in this case.
On the contrary, the teacher model is probably not more knowledgable than the student model regarding this case if its prediction is worse than that of the student (e.g., "will").
More formally, we use Q(y j , y <j , x, ϕ) to quantify how well a student model predicts a target token.It can be defined in the following ways:2 1. Token entropy: calculating the entropy of target tokens without using the ground truth token.
where Y is the vocabulary of the target language.
2. Hard label matching: checking whether the predicted token is identical to the ground truth.
where δ(y, y ′ ) returns 1 if y is identical to y ′ and 0 otherwise.
3. Token-level cross entropy: calculating tokenlevel cross entropy using the given model.
The quantification function for a teacher model Q(y j , y <j , x, θ) can be defined likewise.
Since the transfer set D trans can be equivalently seen as a collection of tuples (7) it can be divided into two parts depending on the comparison between the predictions of teacher and student models: a positive subset D + trans and a negative subset D − trans .A tuple ⟨y j , y <j , x⟩ belongs to Otherwise, it is a negative instance that belongs to D − trans .After splitting the transfer set into two parts, it is natural to apply standard knowledge distillation using the positive subset D + trans : However, one problem is that D + trans may be very small in most cases in practice, making training efficiency very low.
Therefore, instead of discarding the negative subset D − trans , we introduce a new loss function to make the most of negative instances.In analogy to humans, teachers can educate students by telling them what not to do.We expect that the student model can learn from D − trans in the same way.Our intuition is that erroneous tokens with a high probability in teacher model's output distribution are critical because the student is prone to make the same mistakes.Pushing the output distribution of the student model away from the poor target distribution may enable the student model to avoid making the same mistakes.As a result, D − trans can be leveraged effectively and the overall learning efficiency will be improved significantly.Accordingly, the negative KD loss function on the negative subset is defined as where α is a hyper-parameter that controls the activation of the loss.Finally, the knowledge filtration loss is the combination of the two functions:

Knowledge Inheritance
To circumvent the catastrophic forgetting problem, we introduce a loss function to inherit knowledge learned from previous time step for the current student model:

Experiments
To evaluate the effectiveness of our method, we conduct experiments on Chinese-to-English and German-to-English translation under three representative settings including homogeneous, heterogeneous and malicious teacher settings.

Setup
Configurations.For the Chinese-to-English translation experiments under the homogeneous teacher setting, both the teachers and the student are Transformer-base models (Vaswani et al., 2017).Besides model architecture, there are a few other factors that may affect performance, e.g., teacher performance, student performance, model domain, and the order that the teachers arrive.To investigate the impact of model performance and model domain, we leverage five parallel corpora of representative domains as shown in Table 1, among which two are in million scale, one is in middle scale, and the other two are in small scale.Correspondingly, five Transformer-base models are trained on these corpora, denoted as A, B, C, D, and E, respectively.Intuitively, A and B are welltrained while D and E are under-trained due to the training data sizes.To investigate the impact of the order of teachers, we enumerate all the six permutations of A, B and C. In addition, we append D and E to the end of each permutation to simulate the "weak" teacher scenario.Therefore, we have six configurations in total.Specially, we use a string like "ABDE → C" to denote a configuration, which means C is the student, A, B, D and E are the teachers and A arrives first, then B and so on.For simplicity, we use the training set of C as both the training set D train and the transfer set D trans in CKD, and the test set of C is leveraged as D test .The goal in this configuration is to improve the performance of C on D test .In summary, the six configurations are "BCDE→A", "CBDE→A", "ACDE→B", "CADE→B", "ABDE→C", and "BADE→C".
For clarity, the differences of other aforementioned settings with this one will be given in the corresponding sections later.
Evaluation.We leverage the following two metrics to evaluate our method: is the accumulative degradation defined in Eq. 12, which is the lower the better.The last two columns are numbers averaged row-wise.Best results in step 4 are in bold.
• Accumulative Degradation (AD): measuring the accumulative occasional quality degradation in all steps, which should be avoided as much as possible.AD from step 1 to t is defined as follows: where B(•) denotes BLEU(D test , •).
Baselines.Our method is compared with the following baseline methods: • Knowledge Distillation (KD) (Khayrallah et al., 2018) for NMT which applies vanilla knowledge distillation on each token trivially.
• Elastic Weight Consolidation (EWC) (Saunders et al., 2019;Thompson et al., 2019) which is a representative continual learning method that adds an EWC term as a penalty to alleviate catastrophic forgetting.
• Continual Learning for NMT (CL-NMT) (Cao et al., 2021) which is a representative work on multi-step continual learning in NMT.
3 BLEU score is computed using multi-bleu.perl on the corresponding test set for each student model.

Quantification Function Selection
We first evaluate the three candidates for the quantification function Q defined in Sec.2.3.A proper Q should correlate well with model performance and generalize well to a wide range of domains.To this end, we collect six widely used datasets of different domains and varying sizes and evaluate the correlations between the candidates and corpuslevel BLEU scores on them.The Pearson correlation coefficients between token entropy (Eq.4), hard label matching (Eq.5) and token-level cross entropy (Eq.6) are −0.5622,0.8091 and 0.7792, respectively.Both hard label matching and tokenlevel cross entropy are strongly correlated with corpus-level BLEU.However, hard label matching can not break a tie when both the teacher and student models' predictions are correct or incorrect.Therefore, we adopt token-level cross entropy as Q in the rest of this work.Examples and more discussions can be found in Appendix C.

Chinese-to-English Translation
Homogeneous Teacher Setting.In this setting, all the student and teacher models are of the same model architecture, which is Transformer-base.For space limitation, we only show results of three configurations in Table 2.The full results for all configurations can be found in Appendix D.1.From Table 2 we can observe that: (1) Our method achieves improvements over the initial student model in all steps and configurations, and outperforms all baselines significantly.It indicates that our method is effective for leveraging diverse teacher models to continually improve the performance of the student model on its test dataset.
(2) Our method achieves zero or near-zero accumulative performance degradation (AD) scores in all configurationss, indicating our method is also effective to retain acquired knowledge.Especially, when encountering model D (step 3), nearly all baselines face severe quality degradation compared with step 2, while our method even achieves gain in ACDE → B, which further justifies the effectiveness of our method.
(3) All baselines perform poorly after four steps of distillation, indicating that the problem we aim to resolve is challenging.Specifically, KD, the worst one, suffers from severe performance degradation as averaged ∆BLEU and AD scores are −7.09and 9.88, respectively.We argue this is due to KD implicitly assumes that the teacher models are helpful such that it is prone to less beneficial knowledge provided by them.EWC is designed to alleviate catastrophic forgetting and achieves better ∆BLEU and AD scores than KD.However, EWC still fails to achieve improvement over the initial student model, i.e., all ∆BLEU scores are negative.CL-NMT is specially designed for multi-step continual learning in NMT and achieves the best ∆BLEU and AD scores among baselines.Nevertheless, its average ∆BLEU score is significantly smaller than ours (0.40 v.s.1.70) and its average AD score is significantly worse than ours (2.77 v.s.0.05).Overall, the problem to be resolved is challenging and our method is remarkably effective  than baselines.
(4) Despite the promising results, slight performance degradation can still be observed occasionally for our method.Therefore, there is still room for further improvement on retaining acquired knowledge.
Heterogeneous Teacher Setting.Using logits as the medium to transfer and retain knowledge, our approach is model-agnostic and scalable.To justify that, we replace the Transformer-base teacher models with RNN (Bahdanau et al., 2014) and Transformer-big (Vaswani et al., 2017) models, and repeat the experiments in Table 2 with other settings remaining identical.Table 3 shows similar results as Table 2 that our method outperforms all baselines significantly and also achieves zero or near-zero AD scores, indicating that our method is extensible to different model architectures.Interestingly, all the baselines encounter serious performance degradation while the ∆BLEU of our method is nearly zero, indicating that distilling knowledge from a teacher of a completely different architecture may be extremely difficult.It deserves more thoughtful investigation and we leave it as future work.Malicious Teacher Setting.Robustness to malicious models is critical in our scenario as only the parameters rather than training data of teachers are available.We simulate malicious teacher models by shuffling the outputs of a well-trained model within a batch so that the model answers almost completely wrong with high confidence.We repeat the experiments in Table 2 with other settings remaining identical.As shown in Table 4, our approach is far less affected by the malicious model with three different teacher model architectures.Moreover, it could be further explored to detect and skip malicious models to save computational resources directly.

Larger Scale Chinese-to-English Translation
We scale up the dataset size of the Chinese-to-English translation experiment under the homogeneous teacher setting from one million to ten million.Other settings are similar to the original experiments and are detailed in Appendix D.2.As shown in Table 5, our method remains effective while all baseline methods fail to achieve positive quality gain (∆BLEU).This demonstrates that the performance of the baseline methods does not improve as the size of the data and performance of the models increase, while our method remains valid.Thus, it shows that our method is scalable for corpus of different sizes.

German-to-English Translation
We also conduct experiments on German-to-English datasets. of the datasets, only our method consistently obtains BLEU gains and zero or near zero AD scores, exceeding the baselines, demonstrating that our approach is effective for different language pairs.

Ablation Study
Table 7 shows the effect of the negative KD loss ℓ NEG (Eq.9) in knowledge filtration and the knowledge inheritance loss ℓ KI .Results at the beginning (t = 1) and later step (t = 4) for Chinese-to-English translation under the homogeneous teacher setting are reported.We can observe that: 1. Removing either ℓ NEG (row 2) or ℓ KI (row 4) hurts the performance, indicating both of them are effective.
2. Comparing row 1 with row 2, we can conclude that the negative subset of the transfer set where the teacher performs worse than the student (D − trans ) also contains valuable nontrivial knowledge.Furthermore, trivially applying vanilla KD loss ℓ KD on D − trans (row 2 v.s. 3) brings no gain.Therefore, our proposed negative KD loss is effective for making less beneficial knowledge play a good role.
3. Without ℓ KI , the performance drops severely, especially at a later step, verifying that knowledge inheritance is essential for retaining acquired knowledge.

Comparison with Multi-teacher Knowledge Distillation
Multi-teacher KD (Freitag et al., 2017), aka ensemble KD, generally requires all teachers available at the same time, which violates the definition of our problem and may result in enormous computational and memory cost as teacher number grows.Moreover, it is also non-trivial to adapt it to our scenarios due to potential unbeneficial knowledge provided by teachers.Therefore, we do not include it as a major baseline in the experiments above.Nevertheless, in this section, we still provide a comparison of our method with vanilla multi-teacher KD which averages the outputs of all teachers as the target distribution for analysis.The BLEU score of vanilla multi-teacher KD averaged over six configurations is 30.49,lower than our 31.18,indicating that our method is superior to vanilla multi-teacher KD although the comparison is more favorable to it.More details on comparison in terms of task definition, robustness and storage requirement are analyzed in Appendix D.4.

Related Work
Knowledge Distillation.Knowledge distillation (KD) is the most widely used technique for transferring knowledge between models (Hinton et al., 2015).Despite of their effectiveness, conventional KD methods usually implicitly assume that the teacher model is superior or complementary to the student model (Gou et al., 2021).Although recently Qin et al. (2022) allow a big model to learn from small models, they still require that the small models are better than the big model for the given tasks and datasets.However, the assumption does not necessarily hold in our scenario due to the diversity of teacher models.Multi-teacher KD (Freitag et al., 2017;You et al., 2017;Fukuda et al., 2017;Mirzadeh et al., 2020;Liu et al., 2020), which distills knowledge from multiple teachers simultaneously, is highly related to this work.Generally, multi-teacher KD requires all teachers to be available at the same time, which will result in enormous extra memory consummation as the number of teachers grows.More importantly, new teachers may be released constantly (Wolf et al., 2020), which can not be seen in advance.Therefore, multi-teacher KD methods are not feasible to our scenario.L2KD (Chuang et al., 2020) leverages sequential KD to continually learn new tasks, having different goal and challenges compared with our scenario.Another line of related work is selective distillation (Gu et al., 2020;Wang et al., 2021a;Shi and Radu, 2022), which selects data and losses to accelerate KD or enhance model robustness.In contrast, we select data for conducting different ways of distillation in our proposed method.
Continual Learning.Continual learning (CL) for neural machine translation (NMT) aims at learning knowledge of new domains (Thompson et al., 2019;Liang et al., 2021;Cao et al., 2021) or languages (Neubig and Hu, 2018;Garcia et al., 2021;Huang et al., 2022) without forgetting old knowledge.Our scenario also requires learning new knowledge but focuses on improving performance of the student on its test set instead.Moreover, alleviating the negative impact of the less beneficial knowledge conveyed by "weak" teachers is essential in our scenario, which is hardly explored in CL for NMT.While our scenario is a multi-step process, multi-step CL is less explored in NMT (Cao et al., 2021;Liang et al., 2021).Zeng et al. (2019) address a similar task of adapting from multiple out-of-domain models to a single in-domain model.Nevertheless, they assume the training data for the out-of-domain models are available, which is inaccessible in our scenario.Besides, leveraging highresource language NMT models to improve lowresource language translation has also attracted intensive efforts (Neubig and Hu, 2018;Lakew et al., 2019;Liu et al., 2021;Huang et al., 2022), which can be a future extension of our method.

Conclusion and Future Work
To take advantage of increasingly available trained neural machine translation (NMT) models to improve one model of interest, we propose a novel method named continual knowledge distillation.Specially, knowledge from the trained models is transferred to the interested model via knowledge distillation in a sequential manner.Extensive experiments on two language pairs under homogeneous, heterogeneous, and malicious teacher settings show the effectiveness of our proposed method.
In the future, we will further explore the effect of the teacher model order.It is also worth involving more sophisticated methods in knowledge filtration, such as gradient-based and meta-learning-based methods.
Moreover, it is also a promising research direction to exchange knowledge among all the models such that all of them achieve improvement.

Limitations
There are some limitations that have yet to be addressed.Since we use the predicted probability distributions of the model output as a medium for continual KD for NMT, the vocabulary of multiple models needs to be consistent.Overcoming it allows continual KD for NMT to be extended to models with different language pairs and different modalities.Also, although our approach is robust to malicious models, there are more diverse and sophisticated attacks in real-world that require more research on defense.In addition, the teacher and student models must be trained on the same language pair.Further studies can consider more general scenarios without the above limitations.There are other approaches worth exploring in order to address the transfer of knowledge from models rather than their training data besides sequential manner.For example, it is also possible to explore various distillation methods like organizing teacher models into batches or pipelines.

Ethics Statement
In practice, a provider may publicly release a model but may not wish its knowledge to be transferred into another one.Applying our method on such models will result in model stealing (He et al., 2022) related ethical concerns.How to detect this kind of misconduct still needs further exploration.Although sharing knowledge without exposing private data is one of the potential benefits of our method, models produced by our method are still vulnerable to attacks such as membership inference (Hisamoto et al., 2020), and the private training data could still be stolen from the model.By adjusting k a and k b , we can regulate the weights of positive and negative losses.As shown in Table 9, we still use the original settings since no significant performance improvement is found when adjusting k a : k b .

C Exploring Knowledge Filtration
Quantification Function

C.1 Examples
In Table 10, we show three examples to demonstrate how the default quantification function (token-level cross entropy) works in knowledge filtration.
• In the first case, we apply standard knowledge distillation because the teacher model assigns a higher probability of the ground truth token "decorations" than student, indicating a better distribution from the former.
• In the second case, the output from the teacher model is discarded because the negative KD loss exceeds the threshold.It might be a reasonable choice since the output of the teacher is too far from the ground truth token.
• In the third case, the teacher model have slightly worse predictions than students, motivating the student model not to make similar error-prone mistakes.

C.2 Alternatives to Quantification Function
The advantage of token-level cross entropy is that the predictions corresponding to the tokens in the transfer set D trans can be divided into two mutually disjoint parts depending on the comparison between the predictions of teacher and student models.In contrast, hard label matching divides D trans according to whether the teacher and student models predict the ground-truth token correctly, which will result in four parts due ties as shown in Table 11.Are the advantages of these two metrics beneficial for our task?Is it possible to combine the beneficial properties?To answer the questions, we define several metrics in Table 11 to compare these two metrics at a fine-grained level.And the effects of these metrics are shown in Table 12.It could be found that token-level cross entropy always performs better because fewer samples are discarded such that more knowledge is transferred in knowledge distillation.

Table 1 :
The domain, training and evaluation corpora of the five Transformer-base models used in the Chineseto-English experiments.More details of the datasets are provided in Appendix A.

Table 2 :
Results of Chinese-to-English translation under homogeneous teacher setting."BCDE→A" denotes A is the student model and B, C, D, and E are teacher models in step 1 to 4, respectively."∆" denotes ∆BLEU compared with step 0 (i.e., initial student model), and ∆BLEU scores are also reported as subscript numbers."AD"

Table 3 :
Results of Chinese-to-English translation under the heterogeneous teacher setting in step 1, averaged over six configurations."Base" and "Big" denote Transformer-based and Transformer-big models, respectively.And "X→Y" denotes that X is the teacher and Y is the student.

Table 4 :
Results of Chinese-to-English translation under the malicious teacher setting in step 1, averaged over six configurations."(M)" is short for "malicious".

Table 5 :
Results of extending the training data of the Chinese-to-English teacher models to ten million scale under the homogeneous teacher setting, averaged over six configurations.

Table 6 :
Models are trained on four different datasets from different domains.Other settings are similar to the Chinese-to-English experiments and are detailed in Appendix D.3.The average values among each of the homogeneous, heterogeneous, and malicious teacher settings are reported in Table6.Due to the large domain differences Results of German-to-English translation in step 1, averaged over all six setting groups.

Table 7 :
Ablation study on Chinese-to-English translation under homogeneous teacher setting.BLEU scores averaged over six configurations are reported.