How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?

While non-autoregressive (NAR) models are showing great promise for machine translation, their use is limited by their dependence on knowledge distillation from autoregressive models. To address this issue, we seek to understand why distillation is so effective. Prior work suggests that distilled training data is less complex than manual translations. Based on experiments with the Levenshtein Transformer and the Mask-Predict NAR models on the WMT14 German-English task, this paper shows that different types of complexity have different impacts: while reducing lexical diversity and decreasing reordering complexity both help NAR learn better alignment between source and target, and thus improve translation quality, lexical diversity is the main reason why distillation increases model confidence, which affects the calibration of different NAR models differently.


Introduction and Background
When training NAR models for neural machine translation (NMT), sequence-level knowledge distillation (Kim and Rush, 2016) is key to match the translation quality of autoregressive (AR) models (Gu et al., 2018;Lee et al., 2018;Ghazvininejad et al., 2019;. Knowledge distillation was first proposed to obtain small student models that match the quality of a higher-capacity teacher models (Liang et al., 2008;Hinton et al., 2015). Sequence-level knowledge distillation (SLKD) trains the student model p(y | x) to approximate the teacher distribution q(y | x) by maximizing the following objective: L SEQ-KD = − y∈Y q(y | x) log p(y | x) ≈ − y∈Y 1 [y =ŷ] log p(y | x), where Y represents the space of all possible target sequences, andŷ is the output from running beam search with the teacher model q. * Work done during internship at Microsoft Research Asia. However, we do not yet have a clear picture for how SLKD impacts NAR training. Ren et al. (2020) show that SLKD reduces the degree of dependency between target tokens. Gu et al. (2018) hypothesize that SLKD reduces the number of modes in the output distribution (alternative translations for a source). This hypothesis was supported by experiments that use multiway parallel data to simulate the modes .  also investigate the impact of data complexity on NAR translation quality -they generate distilled data of varying complexity with AR models of different capacity and show that higher-capacity NAR models require more complex distilled data to achieve better translation quality. They further show that generating distilled references with mixture of experts (Shen et al., 2019) improves NAR translation quality. However, training samples can be complex in different ways, and it remains unclear how different types of data complexity alter the internal working of NAR models and their translation quality. We also anticipate that data complexity may impact the uncertainty and calibration of NAR models -an understudied question, unlike for AR models (Ott et al., 2018;Wang et al., 2020). This paper focuses on two types of data complexity -lexical diversity and degree of word reordering. We expose two state-of-the-art NAR models (Mask-Predict (Ghazvininejad et al., 2019) and Levenshtein Transformer ) to distilled references of varying complexity on the WMT14 German-English task. Experiments show that decreasing reordering complexity and reducing lexical diversity via distillation both help NAR models learn better alignment between source and target and thus improve translation quality. Further analysis shows that knowledge distillation lowers model uncertainty by reducing lexical diversity, which affects the calibration of Mask-Predict and Levenshtein Transformer models in opposite directions.

Generating Diverse Distilled References
We measure distilled corpus complexity with: • Word Reordering Degree computed by the average fuzzy reordering score (FRS) (Talbot et al., 2011) over all sentence pairs. FRS is an MT evaluation metric introduced to distinguish significant changes in reordering rules of MT systems on syntactically distant language pairs. A higher FRS indicates that the hypothesis is more monotonically aligned to the source.  show that distilled data has a higher FRS than the real data which may benefit NAR models. • Lexical Diversity which captures the diversity of target word choices given a source word. We compute the lexical diversity LD(d) of the distilled corpus d by averaging the entropy of target words y conditioned on a source word x : where V x denotes the source vocabulary. To isolate the impact of complexity factors, we seek to control the faithfulness F (d) of the distilled data d to the real parallel data r. We compute it as the KL-divergence of the alignment distribution between the real data r and the distilled data d : Distilled Sample Generation To encourage diversity according to the corpus-level metrics above, we select distilled references for each source from the k-best list of AR hypotheses, 1 using instantiations of the following score: score(ŷ|x, y) = λ sim(ŷ, y) + (1 − λ) cxty (ŷ, x) where the similarity sim(ŷ, y) measures how faithful the hypothesisŷ is to the original reference y and the complexity cxty(ŷ, x) captures the relationship between the target sequenceŷ and source sequence x. The similarity function is the smoothed sentence-level BLEU (Chen and Cherry, 2014) w.r.t the original reference. We use three different complexity functions: 1) FRS, 2) wordalignment score 2 that measures complexity on a word level, and 3) NMT score 3 that measures complexity on a sentence level.

Experimental Settings
Set-Up We use En-De and De-En datasets from WMT14 (Bojar et al., 2014) with the same preprocessing steps as . We evaluate translation quality with case-sensitive tokenized BLEU, 4 using the Moses tokenizer.
Models We use two state-of-the-art NAR models: • Mask-Predict (MaskT) (Ghazvininejad et al., 2019) uses a masked language model (Devlin et al., 2019) to generate the target sequence by iteratively masking out and regenerating the subset of tokens that the model is least confident about. • Levenshtein Transformer (LevT)  generates the target sequence through iterative insertion and deletion steps. All AR and NAR models adopt the base Transformer architecture (Vaswani et al., 2017). We train all models using a batch size of 64, 800 tokens for maximum 300, 000 steps and select the best checkpoint based on validation perplexity (see Appendix for details). During inference, we set the maximum number of iterations to 10. All word alignments in this paper are generated automatically using fastalign (Dyer et al., 2013). 5

Preliminary: SLKD Helps NAR Learn Word Alignment
Our work is motivated by the hypothesis that SLKD helps NAR models learn (implicit) alignment between source and target words. We first test  Table 2: Translation quality on WMT14 De-En. In the bottom two groups, models are trained on distilled data with similar faithfulness (Faith) but varying degree of reordering (FRS) and lexical diversity (Lex-Div). ↓ marks significant drops compared to the first row in each group based on the paired bootstrap test at p < 0.05 (Clark et al., 2011). this hypothesis by evaluating the effect of SLKD on two datasets: a) En-De train/dev/test sets from WMT14, and b) a synthetic version of the same task, where word alignment information is embedded by pre-reordering the source words so that they are monotonically aligned with target words (in train/dev/test sets). While SLKD improves BLEU by +2.4 on the original En-De task, it has no benefit on the synthetic task (Table 1). This supports our hypothesis and is consistent with other findings on real data: Ghazvininejad et al. (2019) and  showed that SLKD improves the quality of NAR models more on syntactically distant language pairs such as German-English than on Romanian-English. Furthermore, Ran et al. (2019) showed that automatically pre-reordering the source words improves the translation quality of NAR models. However, unlike in our experiment, SLKD is still needed in real translation scenarios, as exactly preordering the source is not feasible at test time. Thus, we turn to understanding how distilled data helps NAR models on real translation tasks.

Reduced Lexical Diversity in SLKD Improves Translation Quality
We have shown that, similar to the effect of prereordering, SLKD benefits NAR training by reducing the difficulty of learning the source-target alignment. However, apart from the word reordering degree, reducing the lexical diversity on the target side can also reduce the difficulty of learning the  alignment. In this section, we investigate how the two types of data complexity affect how well NAR models capture the source-target alignment, and therefore translation quality. SLKD impacts both complexity types: the first two rows of Table 2 show that SLKD increases FRS by +0.09, reduces lexical diversity by −0.18, and boosts the BLEU of MaskT and LevT by 1.6-3.0 over their counterparts trained on real data.
We then compare NAR models trained on distilled data with varying degree of reordering and lexical diversity while controlling for faithfulness (2nd and 3rd group of rows in Table 2). While the absolute BLEU deltas are small, BLEU decreases significantly as the lexical diversity increases despite reduced degree of reordering. This indicates that increased lexical diversity prevails over the effect of lower degree of reordering in decreasing BLEU scores.

SLKD Increases Confidence of Source-Target Attention
To better understand how SLKD helps NAR learn the alignment between source and target, we measure how the confidence of the source-target attention changes over decoding iterations. Following Voita et al. (2019), we define the confidence of attention heads as the average of the maximum attention weights over source tokens, where the average is taken over target tokens. Higher confidence scores indicate that the model is more certain about which parts of the source sequence to attend to when predicting the target tokens. As seen in Figure 2, SLKD increases the confidence of source-target attention on both MaskT and LevT. The increase is larger for MaskT than for LevT. For LevT, SLKD increases the attention confidence the most at early decoding iterations. At later iterations, as the model becomes more confident about which source tokens to attend to given the target tokens generated at previous iterations,   the impact of SLKD becomes smaller.

MaskT
Next, we separate the impact of lexical diversity and word reordering (Figure 3). Reducing both types of complexity leads to more concentrated source-target attention at early iterations. By contrast, models trained on more lexically and syntactically diverse data have more distributed source-  target attention at iterations, and the attention becomes more concentrated at later iterations as more target tokens have been generated.
Overall, these results suggest that reducing lexical diversity and degree of word reordering both help NAR find the source-target alignment and thus reduce the error rate at the early decoding stage. 7 Reduced Lexical Diversity in SLKD Improves Model Confidence Ott et al. (2018) show that the intrinsic uncertainty of translation -due to the existence of multiple semantically equivalent translations for the same source -is a source of uncertainty in the AR models' output distribution. We hypothesize that these effects might be amplified with NAR models, yet little is known about the confidence and calibration of NAR models. We measure the impact of SLKD on model uncertainty using the average token probability of the models' translation outputs, and the inference Expected Calibration Error (ECE) (Wang et al., 2020) that measures how the model's confidence on a prediction matches to the correctness of the prediction. As shown in Table 3, both MaskT and LevT become more confident when trained with SLKD. However, SLKD causes MaskT to be overconfident and hurts its calibration by +11% ECE. 6 By contrast, SLKD changes LevT from underconfident to slightly overconfident, improving its calibration by −5% lower ECE. Next, we isolate the impact of lexical diversity and degree of word reordering on model uncertainty. 7 We measure the average token probability of MaskT and LevT trained on data with varying lexical diversity but close FRS scores (Figure 1a), and vice versa (Figure 1b). Decreasing lexical diversity by −0.02 significantly reduces model uncertainty by 2.1-4.6%, whereas the impact of word reordering degree is small: increasing FRS by +0.08 only increases the average uncertainty by 0.8-1.5%. By contrast, SLKD boosts FRS by +0.09 over the real data. This suggests that reduced lexical diversity is the main reason why SLKD increases model confidence in lexical choice, which raises concerns since Ding et al. (2021) showed that lexical choice errors are also propagated from AR to NAR models through SLKD.

Conclusion
We investigated the effect of knowledge distillation in NAR models trained on distilled data that differs along two types of complexity -lexical diversity and degree of word reordering. Reducing lexical diversity and decreasing word reordering degree both boost the confidence of source-target attention, suggesting that they help NAR models learn the alignment between source and target. Furthermore, distillation increases model confidence by reducing lexical diversity, which improves calibration for LevT but leads to much worse calibration for MaskT. These findings reveal a connection between distillation and existing techniques to improve NAR via pre-reordering (Ran et al., 2019) or integrating external alignment information in the source-target attention (Li et al., 2019). 8 Our findings are based on experiments on the WMT14 English-German corpus, which is widely used in the literature of NAR translation and has interesting typological properties. While we expect these findings to hold for other tasks that exhibit similar degrees of reordering and lexical diversity, it remains to be seen to what degree they generalize to other language pairs and data settings.
We hope that this work will inspire future research on understanding of the positive and negative impact of knowledge distillation on NAR models, as well as of the more advanced approaches to improving NAR by integrating lexical choice and word reordering knowledge. In addition, our work also calls for future work on improving the calibration of NAR models.

B Model and Training Details
All AR and NAR models adopt the base Transformer architecture (Vaswani et al., 2017) with d model = 512, d hidden = 2048, n heads = 8, n layers = 6, and p dropout = 0.3. We tie the source and target embeddings with the output layer weights (Press and Wolf, 2017;Nguyen and Chiang, 2018). We use label smoothing of 0.1. We train the models using Adam (Kingma and Ba, 2015) with initial learning rate of 0.0005 and a batch size of 64, 800 tokens for maximum 300, 000 steps. We select the best checkpoint based on validation perplexity. The total number of parameters is 65M for the AR model, 66M for MaskT, and 91M for LevT. Training takes around 230 hours for each NAR model and 110 hours for each AR model on 4 Tesla P40 GPUs. Table 4 shows the scores of corpus-level metrics, test BLEU and validation perplexity of MaskT and LevT trained on various distilled versions of WMT14 De-En training data generated through diverse reference generation (Section 2).

D Reference Generation Examples
We show that the k-best list generated by the AR model using beam search is both lexically and syntactically diverse through a random example selected from the training set (Table 5)