Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation

Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. However, there exists a discrepancy on low-frequency words between the distilled and the original data, leading to more errors on predicting low-frequency words. To alleviate the problem, we directly expose the raw data into NAT by leveraging pretraining. By analyzing directed alignments, we found that KD makes low-frequency source words aligned with targets more deterministically but fails to align sufficient low-frequency words from target to source. Accordingly, we propose reverse KD to rejuvenate more alignments for low-frequency target words. To make the most of authentic and synthetic data, we combine these complementary approaches as a new training strategy for further boosting NAT performance. We conduct experiments on five translation benchmarks over two advanced architectures. Results demonstrate that the proposed approach can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Encouragingly, our approach achieves 28.2 and 33.9 BLEU points on the WMT14 English-German and WMT16 Romanian-English datasets, respectively. Our code, data, and trained models are available at https://github.com/longyuewangdcu/RLFW-NAT.


Introduction
Recent years have seen a surge of interest in nonautoregressive translation (NAT, Gu et al., 2018), which can improve the decoding efficiency by predicting all tokens independently and simultaneously. The non-autoregressive factorization breaks conditional dependencies among output tokens, * Liang Ding and Longyue Wang contributed equally to this work. Work was done when Liang Ding and Xuebo Liu were interning at Tencent AI Lab. which prevents a model from properly capturing the highly multimodal distribution of target translations. As a result, the translation quality of NAT models often lags behind that of autoregressive translation (AT, Vaswani et al., 2017) models. To balance the trade-off between decoding speed and translation quality, knowledge distillation (KD) is widely used to construct a new training data for NAT models (Gu et al., 2018). Specifically, target sentences in the distilled training data are generated by an AT teacher, which makes NAT easily acquire more deterministic knowledge and achieve significant improvement .
Previous studies have shown that distillation may lose some important information in the original training data, leading to more errors on predicting low-frequency words. To alleviate this problem, Ding et al. (2021b) proposed to augment NAT models the ability to learn lost knowledge from the original data. However, their approach relies on external resources (e.g. word alignment) and human-crafted priors, which limits the applicability of the method to a broader range of tasks and languages. Accordingly, we turn to directly expose the raw data into NAT by leveraging pretraining without intensive modification to model architectures ( §2.2). Furthermore, we analyze bilingual links in the distilled data from two alignment directions (i.e. source-to-target and target-to-source). We found that KD makes low-frequency source words aligned with targets more deterministically but fails to align low-frequency words from target to source due to information loss. Inspired by this finding, we propose reverse KD to recall more alignments for low-frequency target words ( §2.3). We then concatenate two kinds of distilled data to maintain advantages of deterministic knowledge and low-frequency information. To make the most of authentic and synthetic data, we combine three complementary approaches (i.e. raw pretraining, bidirectional distillation training and KD finetuning) as a new training strategy for further boosting NAT performance ( §2.4).
We validated our approach on five translation benchmarks (WMT14 En-De, WMT16 Ro-En, WMT17 Zh-En, WAT17 Ja-En and WMT19 En-De) over two advanced architectures (Mask Predict, Ghazvininejad et al., 2019;Levenshtein Transformer, Gu et al., 2019). Experimental results show that the proposed method consistently improve translation performance over the standard NAT models across languages and advanced NAT architectures. Extensive analyses confirm that the performance improvement indeed comes from the better lexical translation accuracy especially on low-frequency tokens.
Contributions Our main contributions are: • We show the effectiveness of rejuvenating lowfrequency information by pretraining NAT models from raw data.
• We provide a quantitative analysis of bilingual links to demonstrate the necessity to improve low-frequency alignment by leveraging both KD and reverse KD.
• We introduce a simple and effective training recipe to accomplish this goal, which is robustly applicable to several model structures and language pairs.
2 Rejuvenating Low-Frequency Words

Preliminaries
Non-Autoregressive Translation Given a source sentence x, an AT model generates each target word y t conditioned on previously generated ones y <t , leading to high latency on the decoding stage. In contrast, NAT models break this autoregressive factorization by producing target words in parallel. Accordingly, the probability of generating y is computed as: where T is the length of the target sequence, and it is usually predicted by a separate conditional distribution. The parameters θ are trained to maximize the likelihood of a set of training examples according to L(θ) = arg max θ log p(y|x; θ). Typically, most NAT models are implemented upon the framework of Transformer (Vaswani et al., 2017).
Knowledge Distillation Gu et al. (2018) pointed out that NAT models suffer from the multimodality problem, where the conditional independence assumption prevents a model from properly capturing the highly multimodal distribution of target translations. Thus, the sequence-level knowledge distillation is introduced to reduce the modes of training data by replacing their original target-side samples with sentences generated by an AT teacher (Gu et al., 2018;Ren et al., 2020). Formally, the original parallel data Raw and the distilled data − → KD can be defined as follows: (3) where f s →t represents an AT-based translation model trained on Raw data for translating text from the source to the target language. N is the total number of sentence pairs in training data. As shown in Figure 1 (a), well-performed NAT models are generally trained on − → KD data instead of Raw.

Pretraining with Raw Data
Motivation Gao et al. (2018) showed that more than 90% of words are lower than 10e-4 frequency in WMT14 En-De dataset. This token imbalance problem biases translation models towards overfitting to frequent observations while neglecting those low-frequency observations (Gong et al., 2018;Nguyen and Chiang, 2018;. Thus, the AT teacher f s →t tends to generate more high-frequency tokens and less low-frequency tokens during constructing distilled data − → KD. On the one hand, KD can reduce the modes in training data (i.e. multiple lexical choices for a source word), which lowers the intrinsic uncertainty  and learning difficulty for NAT Ren et al., 2020), making it easily acquire more deterministic knowledge. On the other hand, KD aggravates the imbalance of high-frequency and low-frequency words in training data and lost some important information originated in raw data. Ding et al. (2021b) revealed the side effect of distilled training data, which cause lexical choice errors for low-frequency words in NAT models. Accordingly, they introduced an extra bilingual data-dependent prior objective to augments NAT models the ability to learn the lost knowledge from raw data. We use their findings as our departure point, but rejuvenate low-frequency (a) Traditional Training (b) Raw Pretraining (c) Bidirectional Distillation Training Figure 1: An illustration of different strategies for training NAT models. "distill" and "reverse distill" indicate sequence-level knowledge distillation with forward and backward AT teachers, respectively. The data block in transparent color means source-or target-side data are synthetically generated. Best view in color. words in a more simple and direct way: directly exposing raw data into NAT via pretraining.

Data
Our Approach Many studies have shown that pretraining could transfer the knowledge and data distribution, especially for rare categories, hence improving the model robustness (Hendrycks et al., 2019;Mathis et al., 2021). Here we want to transfer the distribution of lost information, e.g. lowfrequency words. As illustrated in Figure 1(b), we propose to first pretrain NAT models on Raw data and then continuously train them on − → KD data. The raw data maintain the original distribution especially on low-frequency words. Although it is difficult for NAT to learn high-mode data, the pretraining can acquire general knowledge from authentic data, which may help better and faster learning further tasks. Thus, we early stop pretraining when the model can achieve 90% of the best performance of raw data in terms of BLEU score (Platanios et al., 2019) 1 . In order to keep the merits of low-modes, 1 In preliminary experiments, we tried another simple strategy: early-stop at fixed step according to the size of training data (e.g. training 70K En-De and early stop at 20K / 30K / 40K, respectively). We found that both strategies achieve Data Sentence Table 2: An example in different kinds of data. "Raw" means the original data while " − → KD" and " ← − KD" indicate syntactic data distilled by KD and reverse KD, respectively. The subscript "S" or "T" is short for source-or target-side. The low-frequency words are highlighted with colors and italics are incorrect translations.
we further train the pretrained model on distilled data − → KD. As it is easy for NAT to learn deterministic knowledge, we finetune the model for the rest steps. For fair comparison, the total training steps of the proposed method are same as the traditional one. In general, we expect that this training recipe can provide a good trade-off between raw and distilled data (i.e. high-modes and complete vs. low-modes and incomplete).

Bidirectional Distillation Training
Analyzing Bilingual Links in Data KD simplifies the training data by replacing low-frequency target words with high-frequency ones . This is able to facilitate easier aligning source words to target ones, resulting in high bilingual coverage (Jiao et al., 2020). Due to the information loss, we argue that KD makes lowfrequency target words have fewer opportunities to align with source ones. To verify this, we propose a method to quantitatively analyze bilingual links from two directions, where low-frequency words similar performance. are aligned from source to target (s → t) or in an opposite direction (t → s).
The method can be applied to different types of data. Here we take s → t links in Raw data as an example to illustrate the algorithm. Given the WMT14 En-De parallel corpus, we employ an unsupervised word alignment method 2 (Och and Ney, 2003) to produce a word alignment, and then we extract aligned links whose source words are low-frequency (called s → t LFW Links). Second, we randomly select a number of samples from the parallel corpus. For better comparison, the subset should contains the same i in Equation (2) as that of other type of datasets (e.g. i in Equation (3) for − → KD). Finally, we calculate recall, precision, F1 scores based on low-frequency bilingual links for the subset. Recall (R) represents how many low-frequency source words can be aligned to targets. Precision (P) means how many aligned low-frequency links are correct according to human evaluation. F1 is the harmonic mean between precision and recall. Similarly, we can analyze t → s LFW Links by considering low-frequency targets. with worse alignment quality (79.1 vs. 80.6) in − → KD than those in Raw. This confirms our claim that KD harms NAT models due to the loss of lowfrequency target words. Inspired by these findings, it is natural to assume that reverse KD exhibits complementary properties. Accordingly, we conduct the same analysis method on ← − KD data, and found better t → s links but worse s → t links compared with Raw. Take the Zh-En sentence pair in Table 2 for example, − → KD retains the source side lowfrequency Chinese words "海克曼" (Raw S ) but generates the high-frequency English words "Heckman" instead of the golden "Hackman" ( − → KD T ). On the other hand, ← − KD preserves the low-frequency English words "Hackman" (Raw T ) but produces the high-frequency Chinese words "哈克曼" ( ← − KD S ).
Our Approach Based on analysis results, we propose to train NAT models on bidirectional distil-lation by concatenating two kinds of distilled data. The reverse distillation is to replace the source sentences in the original training data with synthetic ones generated by a backward AT teacher. 3 According to Equation 3, ← − KD can be formulated as: where f t →s represents an AT-based translation model trained on Raw data for translating text from the target to the source language. Figure 1(c) illustrates the training strategy. First, we employ both f s →t and f t →s AT models to generate − → KD and ← − KD data, respectively. Considering complementarity of two distilled data, we combine − → KD and ← − KD as a new training data for training NAT models. We expect that 1) distilled data can maintain advantages of low-modes; 2) bidirectinoal distillation can recall more LFW links on two directions with better alignment quality, leading to the overall improvements. Besides, Nguyen et al. (2020) claimed that combining different distilled data (generated by various models trained with different seeds) improves data diversification for NMT, and we leave this for future work.

Combining Both of Them:
Low-Frequency Rejuvenation (LFR) We have proposed two parallel approaches to rejuvenate low-frequency knowledge from authentic ( §2.2) and synthetic ( §2.3) data, respectively. Intuitively, we combine both of them to further improve the model performance.
From data view, two presented training strategies are: Raw → − → KD (Raw Pretraining) and − → KD + ← − KD (Bidirectional Distillation Training). Considering the effectiveness of pretraining (Mathis et al., 2021) and clean finetuning (Wu et al., 2019), we introduce a combined pipeline: Raw → − → KD + ← − KD → − → KD as out best training strategy. There are many possible ways to implement the general idea of combining two approaches. The aim of this paper is not to explore the whole space but simply to show that one fairly straightforward implementation works well and the idea is reasonable. Nonetheless, we compare possible strategies of combination two approaches as well as demonstrate their complementarity in §3.3. While in main experiments (in §3.2), we valid the combination strategy, namely Low-Frequency Rejuvenation (LFR).   Models We validated our research hypotheses on two state-of-the-art NAT models:

Model
• Mask-Predict (MaskT, Ghazvininejad et al. 2019) that uses the conditional mask LM (Devlin et al., 2019) to iteratively generate the target sequence from the masked input. We followed its optimal settings to keep the iteration number as 10 and length beam as 5.
• Levenshtein Transformer (LevT, Gu et al. 2019) that introduces three steps: deletion, placeholder and token prediction. The decoding iterations adaptively depends on certain conditions. We closely followed previous works to apply sequence-level knowledge distillation to NAT (Kim and Rush, 2016 Table 4: Performance on other language pairs, including WMT17 Zh-En and WAT17 Ja-En. " † " indicates statistically significant difference (p < 0.05) from corresponding baselines.
128K tokens/batch). In this work, we empirically adopt large batch strategy (i.e. 480K tokens/batch) to reduce the training steps for NAT (i.e. 70K). Accordingly, the learning rate warms up to 1 × 10 −7 for 10K steps, and then decays for 60k steps with the cosine schedule (Ro-En models only need 4K and 21K, respectively). For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction, and apply weight decay with 0.01 and label smoothing with = 0.1. We use Adam optimizer (Kingma and Ba, 2015) to train our models. We followed the common practices (Ghazvininejad et al., 2019;Kasai et al., 2020) to evaluate the performance on an ensemble of top 5 checkpoints to avoid stochasticity. Note that the total training steps of the proposed approach (in §2.2∼2.4) are identical with those of the standard training (in §2.1). Taking the best training strategy (Raw → − → KD + ← − KD → − → KD) for example, we empirically set the training step for each stage is 20K, 20K and 30K, respectively. And Ro-En models respectively need 8K, 8K and 9K steps in corresponding training stage. Table 3 lists the results of previous competitive NAT models (Gu et al., 2018;Lee et al., 2018;Kasai et al., 2020;Gu et al., 2019;Ghazvininejad et al., 2019) on the WMT16 Ro-En and WMT14 En-De benchmark. We implemented our approach on top of two advanced NAT models (i.e. Mask-Predict and Levenshtein Transformer). Compared with standard NAT models, our training strategy significantly and consistently improves translation performance (BLEU↑) across different language pairs and NAT models. Besides, the improvements on translation  performance are mainly due to a increase of translation accuracy on low-frequency words (ALF↑), which reconfirms our claims. For instance, our method significantly improves the standard Mask-Predict model by +0.8 BLEU score with a substantial +3.6 increase in ALF score. Encouragingly, our approach push the existing NAT models to achieve new SOTA performances (i.e. 28.2 and 33.9 BLEU on En-De and Ro-En, respectively).

Comparison with Previous Work
It is worth noting that our data-level approaches neither modify model architecture nor add extra training loss, thus do not increase any latency ("Speed"), maintaining the intrinsic advantages of non-autoregressive generation. We must admit that our strategy indeed increase the amount of computing resources due to that we should train f t →s AT teachers for building ← − KD data. Table 4 lists the results of NAT models on Zh-En and Ja-En language pairs, which belong to different language families (i.e. Indo-European, Sino-Tibetan and Japonic). Compared with baselines, our method significantly and incrementally improves the translation quality in all cases. For Zh-En, LFR achieves on average +0.8 BLEU improvement over the traditional training, along with increasing on average +3.0% accuracy on low-frequency word translation. For long-distance language pair Ja-En, our method still improves the NAT model by on average +0.7 BLEU point with on average +2.2 ALF. Furthermore, NAT models with the proposed training strategy perform closely to their AT teachers (i.e. 0.2 ∆BLEU). This shows the effectiveness and universality of our method across language pairs.  Table 6: Performance on different scale of training data. The small and medium datasets are sampled from the large WMT19 En-De dataset, and evaluations are conducted on the same testset. " † " indicates statistically significant difference (p < 0.05) from corresponding baselines.

Results on Other Language Pairs
Results on Domain Shift Scenario The lexical choice must be informed by linguistic knowledge of how the translation model's input data maps onto words in the target domain. Since low-frequency words get lost in traditional NAT models, the problem of lexical choice is more severe under domain shift scenario (i.e. models are trained on one domain but tested on other domains). Thus, we conduct evaluation on WMT14 En-De models over five out-of-domain test sets (Müller et al., 2020), including law, medicine, IT, Koran and movie subtitle domains. As shown in Table 5, standard NAT models suffer large performance drops in terms of BLEU score (i.e. on average -2.9 BLEU over AT model). By observing these outputs, we found a large amount of translation errors on low-frequency words, most of which are domain-specific terminologies. In contrast, our approach improves translation quality (i.e. on average -1.4 BLEU over AT model) by rejuvenating low-frequency words to a certain extent, showing that LFR increases the domain robustness of NAT models.

Results on Different Data Scales
To confirm the effectiveness of our method across different data sizes, we further experiment on three En-De datasets at different scale. The small-and mediumscale training data are randomly sampled from WM19 En-De corpus, containing about 1.0M and 4.5M sentence pairs, respectively. The large-scale one is collected from WMT19, which consists of 36M sentence pairs. We report the BLEU scores on same testset newstest2019 for fair comparison. We employs base model to train the small-scale AT teacher, and big model with large batch strategy (i.e. 458K tokens/batch) to build the AT teachers for medium-and large-scale. As seen in Table 6 Table 7: Complementary to other work. "Combination" indicates combining "+Raw Data Prior" proposed by Ding et al. (2021b) with our "+Low-Frequency". Experiments are conducted on WMT14 En-De.
NAT models across different size of datasets, especially on large scale (+0.9), showing the robustness and effectiveness of our approach.
Complementary to Related Work Ding et al. (2021b) is relevant to our work, which introduced an extra bilingual data-dependent prior objective to augment NAT models the ability to learn lowfrequency words in raw data. Our method is complementary to theirs due to that we only change data and training strategies (model-agnostic). As shown in Table 7, two approaches yield comparable performance in terms of BLEU and ALF. Besides, combination can further improve BLEU as well as ALF scores (i.e. +0.3 and +0.6). This illustrates the complementarity of model-level and data-level approaches on rejuvenating low-frequency knowldege for NAT models.

Analysis
We conducted extensive analyses to better understand our approach. All results are reported on the Mask-Predict models.
Accuracy of Lexical Choice To understand where the performance gains come from, we conduct fine-grained analysis on lexical choice. We divide "All" tokens into three categories based on their frequency, including "High", "Medium" and "Low". Following Ding et al. (2021b), we measure the accuracy of lexical choice on different frequency of words. Table 8 shows the results. Takeaway: The majority of improvements on translation accuracy is from the low-frequency words, confirming our hypothesis.

Low-Frequency Words in Output
We expect to recall more low-frequency words in translation output. As shown in Table 9, we calculate the ratio of low-frequency words in generated sentences. As seen, KD biases the NAT model towards gen-  Table 8: Analysis on different frequency words in terms of accuracy of lexical choice. We split "All" words into "High", "Medium" and "Low" categories. Shades of cell color represent differences between ours and KD.

Effects of Variant Training Strategies
As discussed in §2.4, we carefully investigate alternative training approaches in Table 10. We make the total training step identical to that of vanilla NAT models, and report both BLEU and ALF scores. As seen, all variant strategies perform better than the standard KD method in terms both BLEU and  Table 11: Analysis on AT models in term of the accuracy of lexical choice on WMT14 En-De. We split "All" words into "High", "Medium" and "Low" categories.
ALF scores, confirming the necessity of our work. Takeaway: 1) Pretraining is more effective than combination on utilizing data manipulation strategies; 2) raw data and bidirectional distilled data are complementary to each other; 3) it is indispensable to finetune models on − → KD in the last stage.
Our Approach Works for AT Models Although our work is designed for NAT models, we also investigated whether our LFT method works for general cases, e.g. autoregressive models. We used Transformer-BIG as the teacher model. For fair comparison, we leverage the Transformer-BASE as the student model, which shares the same model capacity with NAT student (i.e. MaskT). The result lists in Table 11. As seen, AT models also suffer from the problem of low-frequency words when using knowledge distillation, and our approach also works for them. Takeaway: Our method works well for general cases through rejuvenating more low-frequency words.

Related Work
Low-Frequency Words Benefiting from continuous representation learned from the training data, NMT models have shown the promising performance. However, Koehn and Knowles (2017) point that low-frequency words translation is still one of the key challenges for NMT according to the Zipf's law (Zipf, 1949). For AT models, Arthur et al. (2016) address this problem by integrating a count-based lexicon, and Nguyen and Chiang (2018) propose an additional lexical model, which is jointly trained with the AT model. Recently,  adaptively re-weight the rare words during training. The lexical choice problem is more serious for NAT models, since 1) the lexical choice errors (low-resource words in particular) of AT distillation will propagate to NAT models; and 2) NAT lacks target-side dependencies thus misses necessary target-side context. In this work, we alleviate this problem by solving the first challenge.
Data Manipulation Our work is related to previous studies on manipulating training data for NMT. Bogoychev and Sennrich (2019) show that forwardand backward-translations (FT/ BT) could both boost the model performances, where FT plays the role of domain adaptation and BT makes the translation fluent. Fadaee and Monz (2018) sample the monolingual data with more difficult words (e.g. rare words) to perform BT, achieving significant improvements compared with randomly sampled BT. Nguyen et al. (2020) diversify the data by applying FT and BT multiply times. However, different from AT, the prerequisite of training a well-performed NAT model is to perform KD. We compared with related works in Table 10 and found that our approach consistently outperforms them. Note that all the ablation studies focus on exploiting the parallel data without augmenting additional data.
Non-Autoregressive Translation A variety of approaches have been exploited to bridge the performance gap between NAT and AT models. Some researchers proposed new model architectures (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Kasai et al., 2020), aided with additional signals Ran et al., 2019;Ding et al., 2020), introduced sequential information Shao et al., 2019;Guo et al., 2020;Hao et al., 2021), and explored advanced training objectives Du et al., 2021). Our work is close to the research line on training methods. Ding et al. (2021b) revealed the low-frequency word problem in distilled training data, and introduced an extra Kullback-Leibler divergence term derived by comparing the lexical choice of NAT model and that embedded in the raw data. Ding et al. (2021a) propose a simple and effective training strategy, which progressively feeds different granularity of data into NAT models by leveraging curriculum learning.

Conclusion
In this study, we propose simple and effective training strategies to rejuvenate the low-frequency information in the raw data. Experiments show that our approach consistently and significantly improves translation performance across language pairs and model architectures. Notably, domain shift is an extreme scenario to diagnose low-frequency translation, and our method significant improves them. Extensive analyses reveal that our method improves the accuracy of lexical choices for low-frequency source words, recalling more low-frequency words in translations as well, which confirms our claim.