Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade

Fully non-autoregressive neural machine translation (NAT) is proposed to simultaneously predict tokens with single forward of neural networks, which significantly reduces the inference latency at the expense of quality drop compared to the Transformer baseline. In this work, we target on closing the performance gap while maintaining the latency advantage. We first inspect the fundamental issues of fully NAT models, and adopt dependency reduction in the learning space of output tokens as the basic guidance. Then, we revisit methods in four different aspects that have been proven effective for improving NAT models, and carefully combine these techniques with necessary modifications. Our extensive experiments on three translation benchmarks show that the proposed system achieves the new state-of-the-art results for fully NAT models, and obtains comparable performance with the autoregressive and iterative NAT systems. For instance, one of the proposed models achieves 27.49 BLEU points on WMT14 En-De with approximately 16.5X speed up at inference time.


Introduction
State-of-the-art neural machine translation (NMT) systems are based on autoregressive models (Bahdanau et al., 2015;Vaswani et al., 2017) where each generation step depends on the previously generated tokens. Such sequential nature inevitably leads to inherent latency at inference time. Nonautoregressive neural machine translation models (NAT, Gu et al., 2018a) attempt to generate output sequences in parallel to speed-up the decoding process. The incorrect independence assumption nevertheless prevents NAT models to properly learn the dependency between target tokens in real data distribution, resulting in a performance drop compared to autoregressive (AT) models. One popular * Equal contribution. solution to improve the translation accuracy is to sacrifice the speed-up by incorporating an iterative refinement process in NAT models, through which the model explicitly learns the conditional distribution over partially observed reference tokens. However, recent studies (Kasai et al., 2020b) indicated that iterative NAT models seem to lose the speed advantage after carefully tuning the layer allocation of AT models. For instance, an AT model with deep encoder and shallow decoder (12-1) obtains similar latency as iterative NAT models without hurting the translation accuracy.
Several works Saharia et al., 2020;Qian et al., 2020) have recently been proposed to improve the fully NAT models, though the performance gap compared to the iterative ones remains. How to build a fully NAT model with competitive translation accuracy calls for more exploration. In this work, we first argue that the key to successfully training a fully NAT model is to perform dependency reduction in the learning space of output tokens ( § 2). With this guiding principle, we revisit various methods which are able to reduce the dependencies among target tokens as much as possible from four different perspectives, i.e., training corpus ( § 3.1), model architecture ( § 3.2), training objective ( § 3.3) and learning strategy ( § 3.4). Furthermore, such target token dependencies cannot be perfectly removed by any of these aspects only. The performance gap can not be near closed unless we make full use of these techniques' advantages.
We validate our fully NAT model on standard translation benchmarks including 5 translation directions. The proposed system achieves new stateof-the-art results for fully NAT models. Moreover, compared to the Transformer baseline, our models achieve comparable performance with over 16× speed-up at inference time.

Motivation
Given an input sequence x = x 1 . . . x T , an autoregressive model (Bahdanau et al., 2015;Vaswani et al., 2017) predicts the target y = y 1 . . . y T sequentially based on the conditional distribution p(y t |y <t , x 1:T ; θ), which tends to suffer from high latency in generation especially for long sequences. In contrast, non-autoregressive machine translation (NAT, Gu et al., 2018a), proposed for speedingup the inference by generating all the tokens in parallel, has recently been on trend due to its nature of parallelizable on devices such as GPUs and TPUs. A typical NAT system assumes a conditional independence in the output token space, that is where θ is the parameters of the model. Typically, NAT models are also modeled with Transformer encoder-decoder without causal attention map in the decoder side. However, as noted in Gu et al. (2018a), the independence assumption generally does not hold in real data distribution for sequence generation tasks such as machine translation (Ren et al., 2020), where the failure of capturing such dependency between target tokens leads to a serious performance degradation in NAT. This is a fairly understandable but fundamental issue of NAT modeling which can be easily shown with a toy example in Figure 2. Given a simple corpus with only two examples: AB and BA, each of which has 50% chances to appear. It is designed to represent the dependency that symbol A and B should co-occur. Although such simple distribution can be instantly captured by any autoregressive model, learning the vanilla NAT model with maximum likelihood estimation (MLE, Eq. (1)) assigns probability mess to incorrect outputs (AA, BB) even these samples never appear during training. In practice, the dependency in real translation corpus is much more complicated. As shown in Figure 1, despite the inference speed-up, training the vanilla NAT the same as Transformer in a comparable size leads to quality drop over 10 BLEU points.
To ease the modeling difficulty, recent stateof-the-art NAT systems (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Kasai et al., 2020a;Shu et al., 2020;Saharia et al., 2020) trade accuracy with latency by incorporating an iterative refinement process in non-autoregressive prediction, and have already achieved comparable or even better performance than the autoregressive counterpart. Nevertheless, Kasai et al. (2020b) showed autoregressive models with a deep encoder and a shallow decoder can readily outperform strong iterative models with similar latency, indicating that the latency advantage of iterative NAT has been overestimated.
By contrast, while maintaining a clear speed advantage, fully NAT system -model makes parallel predictions with single neural network forwardstill lags behind in translation quality and has not been fully explored in literature (Libovický and Helcl, 2018;Ma et al., 2019;. This motivates us in this work to investigate various approaches to push the limits of learning a fully NAT model towards autoregressive models regardless of the architecture choices (Kasai et al., 2020b).

Methods
In this section, we discuss several important ingredients to train a fully NAT model. As discussed in § 2, we argue that the guiding principle of designing any NAT models is to perform dependency reduction as much as possible in the output space so that it can be captured by the NAT model. For  Figure 3: The overall framework of our fully NAT model. example, iterative-based models (Ghazvininejad et al., 2019) explicitly reduce the dependencies between output tokens by learning the conditional distribution over the observed reference tokens. The overall framework of training our fully NAT system is presented in Figure 3.

Data: Knowledge Distillation
The most effective dependency reduction technique is knowledge distillation (KD) (Hinton et al., 2015;Kim and Rush, 2016) which is firstly proposed in Gu et al. (2018a) and has been widely employed for all subsequent NAT systems. We replace the original target samples with sentences generated from a pre-trained autoregressive model. As analyzed in , KD is able to simplify the training data where the generated targets have less noise and are aligned to the inputs more deterministically.  also showed that the capacity of the teacher model should be constrained to match the desired NAT model to avoid further degradation in KD, especially for weak NAT students without iterative refinement. We treat Transformer base as the default teacher model to generate the distilled corpus.

Model: Latent Variables
In contrast to iterative NAT, dependency reduction can be done with (barely) zero additional cost at inference time by utilizing latent variables as part of the model. In such cases, output tokens y 1:T are modeled conditionally independent over the latent variables z which are typically predicted from the inputs x 1:T : where z can be pre-defined by external library (e.g. fertility in Gu et al. (2018a)), or jointly optimized with the NAT model using normalizing flow (Ma et al., 2019) or variational auto-encoders (VAEs) (Kaiser et al., 2018;Shu et al., 2020).
In this work, we followed the formulation proposed in Shu et al. (2020) where continuous latent variables z ∈ R T ×D are modeled as spherical Gaussian at the encoder output of each position. Like standard VAEs (Kingma and Welling, 2013), we train such model to maximize the evidence lower-bound (ELBO) with a posterior network q φ : (3) where we use an encoder-decoder architecture similar to the main model to encode q φ (z|x, y). In our implementation, only the parameters of the embedding layers are shared between θ and φ

Loss Function: Latent Alignments
Standard sequence generation models are trained with the cross entropy (CE) loss which compares model's output with target tokens at each corresponded position. However, as NAT ignores the dependency in the output space, it is almost impossible for such models to accurately model token offset. For instance, while with little effect to the meaning, simply changing Vielen Dank ! to , Vielen Dank causes a huge penalty for fully NAT models.
To ease the limitations in loss computation, recent works have proposed to consider latent alignments between the target positions, and "actively" search the best (AXE,  or marginalize all the alignments (CTC, Libovický and Helcl, 2018;Saharia et al., 2020) with dynamic programming. The dependency is reduced because the NAT model is able to freely choose the best prediction regardless of the target offsets.
In this work, we put our major focus on the CTC loss (Graves et al., 2006) considering its superior performance and the flexibility of variable length prediction. Formally, given the conditional independence assumption, CTC is capable of efficiently finding all valid aligned sequences a which the target y can be recovered from, and marginalize log-likelihood: where Γ −1 (a) is the collapse function that recovers the target sequence by collapsing consecutive repeated tokens, and then removing all blank tokens. Furthermore, it is straightforward to apply the same CTC loss into latent variable models by replacing the likelihood term in Eq (3) with the CTC loss. Note that both CTC and AXE make strong assumptions of monotonic alignment, which makes them impossible to reduce all dependencies between target tokens in real distribution.

Learning: Glancing Targets
Although it is possible to directly optimize the fully NAT model towards the target sequence log p θ (y|x) (or ELBO if using latent variables), Ghazvininejad et al. (2019) showed that it improved test time performance by training the NAT model with randomly sampled reference tokens as inputs log p θ (y|m y, x), m ∼ γ(l, y), l ∼ U |y| , where m is the mask, and γ is the sampling function given the number of masked tokens l. As mentioned earlier, we suspect such explicit modeling of the distribution conditional to observed tokens assists the dependency reduction in the output space.
Curriculum Learning Naively applying random masks for every training example may cause severe mismatch between training and testing. Qian et al. (2020) proposed GLAT -a curriculum learning strategy, in which the ratio of observed target tokens is related to the inference quality of the fully NAT model. More precisely, instead of sampling uniformly, we sample l by: whereŷ = arg max y log p θ (y|x), D is the discrepancy between the model prediction and the target sequence, e.g. Levenshtein distance (Levenshtein, 1966), and f ratio is a hyperparameter to adjust the mask ratio. The original formulation (Qian et al., 2020) utilized a deterministic mapping (g), while we use a stochastic function such as Poisson distribution to enable sampling a wider range of lengths including "no glancing". The original GLAT (Qian et al., 2020) assumes to work with golden length so that it can glance the target by placing the target word embedding to the corresponded inputs. It is natural to use GLAT for models with CE or AXE loss, while incompatible with CTC as we always require the inputs longer than the targets. To enable GLAT training, we glance target tokens from the viterbi aligned tokenŝ a = arg max a∈Γ(y) p θ (a|x) which has the same length as the decoder inputs.
Intuitively, the poorly trained model will observe more target tokens. When the model becomes better and generate higher quality sequences, the number of masked words will be larger, which helps the model gradually learn generating the whole sentence.

Summary
To sum up, we conclude all proposed aspects with methods in Table 1 which are able to effectively help close the performance gap for fully nonautoregressive models. Note that, although none of these solutions are perfect that reduces the harmful dependency in the output space of NAT models. In practice, we find these methods target on different aspects of dependency reduction and indicates the improvements may be complementary.

Experiments
We perform extensive experiments on three challenging translation datasets by combining all mentioned techniques to check (1) whether the proposed aspects for dependency reduction are complementary; (2) how much we can minimize the gap between a fully non-autoregressive model with the autoregressive counterpart.
Knowledge Distillation Following previous efforts, the NAT model in our experiments is trained on distilled data generated from pre-trained transformer models (base for WMT14 EN↔DE and WMT16 EN↔RO and big for WMT20 JA→EN) using beam search with a beam size 5.
Decoding At inference time, the most straightforward way is to generate the sequence with the highest probability at each position. The outputs from the CTC-based NAT models require additional collapse process Γ −1 which can be done instantly.
A relatively more accurate method is to decode multiple sequences, and rescore them to obtain the best candidate in parallel, i.e. noisy parallel decoding (NPD, Gu et al., 2018a). Furthermore, CTC-based models are also capable of decoding sequences using beam-search (Libovický and Helcl, 2018), and optionally combined with n-gram language models (Heafield, 2011;Kasner et al., 2020). More precisely, we search in a beam to approxi-1 https://github.com/bitextor/bicleaner mately find the optimal y * that maximizes: log p θ (y|x) + α · log p LM (y) + β log |y| (6) where α and β are hyperparameters for language model scores and word insertion bonus. In principle, it is no longer non-autoregressive as beamsearch is a sequential process by nature. It does not contain any neural network computations and can be implemented efficiently in C++ 2 .
Baselines We adopt Transformer (AT) and existing NAT approaches (see Table 2) for comparison. For Transformers, except for using the standard base and big architectures (Vaswani et al., 2017) as baselines, we also compare with a deep encoder shallow encoder Transformer suggested in Kasai et al. (2020b) that follows the parameterizations of base with 12 encoder layers and 1 decoder layer (i.e. base (12-1) for short).
Evaluation BLEU (Papineni et al., 2002) is used to evaluate the translation performance for all models. Following prior works, we compute tokenized BLEUs for EN↔DE and EN↔RO, while using SacreBLEU (Post, 2018) for JA→EN.
In this work, we use three measures to fully investigate the translation latency of all the models: • L GPU 1 : translation latency by running the model with one sentence at a time on a single GPU. For most of existing works on NAT models, L GPU 1 is de-facto which aligns applications like instantaneous machine translation.
• L CPU 1 : the same as L GPU 1 while running the model without GPU speed-up. Compared to L GPU 1 , it is less friendly to NAT models that make use of parallelism, however, closer to real scenarios.
• L GPU max : the same as L GPU 1 on GPU while running the model in a batch with as many sentences as possible. In this case, the hardware memory bandwidth are taken into account.
We measure the wall-clock time for translating the whole test set, and report the averaged time over sentences as the latency measure.

Implementation Details
We design our fully NAT model with the hyperparameters of the base Transformer: 8-512-2048 (Vaswani et al., 2017). For EN→DE experiments, we also implement the NAT model in big size: 8-1024-4096 for comparison. For experiments using variational autoencoders (VAE), we use the last layer encoder hidden states to predict the mean and variance of the prior distribution. The latent dimension D is set to 8, and the predicted z are linearly projected and added on the encoder outputs. Following Shu et al. (2020), we use a 3 layer encoder-decoder as the posterior network, and apply freebits annealing (Chen et al., 2016) to avoid posterior collapse. By default, we set the length for decoder inputs 3× as long as the source for CTC, while using the golden length for other objectives (CE, AXE). We also learn an additional length predictor when CTC is not used. For both cases, we use SoftCopy (Wei et al., 2019) which interpolated the encoder outputs as the decoder inputs based on the relative distance of source and target positions. We choose mask ratio f ratio = 0.5 for GLAT training.
For both AT and NAT models, we set the dropout rate 0.3 for EN↔DE and EN↔RO, and 0.1 for JA→EN. We apply weight decay 0.01 as well as label smoothing = 0.01. All models are trained for 300K updates using Nvidia V100 GPUs with a batch size of approximately 128K tokens. We measure the validation BLEU scores for every 1000 updates, and average the best 5 checkpoints to obtain the final model. We measure the GPU latency by running the model on a single Nvidia V100 GPU, and CPU latency on Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz with 80 cores. All models are implemented on fairseq (Ott et al., 2019).

Results
WMT'14 EN↔DE & WMT'16 EN↔RO We report the performance of our fully NAT model comparing with AT and existing NAT approaches (including both iterative and fully NAT models) in Table 2. Iterative NAT models with enough number of iterations generally outperform fully NAT  baselines by a certain margin as they are able to recover the generation errors by explicitly modeling dependencies between (partially) generated tokens. However, such speed advantage is relatively marginal compared to AT base (12-1) which also achieves 2.5 times faster than the AT baseline.
Conversely, our fully NAT models are able to readily achieve over 16 times speed-up on EN→DE by restricting translation with a single iteration. Surprisingly, simply training NAT with KD and CTC loss already beats the state-of-the-art for a single iteration NAT models across all four directions. Moreover, combining with either latent variables (VAE) or glancing targets (GLAT) further closes the performance gap or even outperforms the AT results on both language pairs. For example, our best model achieves 27.49 BLEU on WMT14 EN-DE -almost identical to the AT performance (27.48) while 16.5 times faster in the inference time. Also, Table 2 indicates the difficulties of learning NAT on each dataset. For instance, EN↔RO is relatively easier where "KD + CTC" is enough to reduce target dependencies, while by contrast, applying VAE or GLAT helps to capture non-monotonic dependencies and improve by 0.5 ∼ 1 BLEU points on EN↔ED. For both datasets, we ONLY need sin-gle greedy generation to achieve similar translation quality as AT beam-search results.
WMT'20 JA→EN We also present results for training the fully NAT model on a more challenging benchmark of WMT'20 JA→EN which is much larger (13M pairs) and noisier. In addition, JA is linguistically distinct to EN which makes it harder to learn mappings between them. Consequently, both AT (12-1) and our fully NAT models (see Table 3) become less confident and tend to produce shorter translations (BP < 0.9), and in turn underperform the AT teacher even trained with KD.
Beam search & NPD The performance of our NAT can be further boosted by allowing additional searching (beam-search) or re-ranking (NPD) after prediction. For CTC beam search, we use a fixed beam-size 20 while grid-search α, β (Eq.(6)) based on the performance on the validation set. The language model 3 is trained directly on the distilled target sentences to avoid introducing additional information. For noisy parallel decoding (NPD), we draw multiple z from the learned prior distribution 4 , and use the teacher model (AT big) to rerank the best z with the corresponded translation.
As shown in Table 3, in spite of similar GPU latency (L GPU 1 ), beam search is much more effective than NPD with re-ranking, especially combined with a 4-gram LM where we achieve a BLEU score of 21.41, beating the teacher model with 11× speed-up. More importantly, by contributing the insertion bonus (3rd term in Eq (6)) with β in beam search, we have the explicit control to improve BP and output longer translations. Also, we gain another half point by combining NPD and beam search. To have a fair comparison, we also report latency on CPUs where it is limited to leverage parallelism of the device. The speed advantage drops rapidly for NAT models, especially for NAT with NPD, however, we still maintain around 100 ms latency via beam search -over 2× faster than the AT (12-1) systems with higher translation quality.
Quality v.s. Latency We perform a full investigation for the trade-off between translation quality and latency across AT, iterative NAT and our fully NAT models. The results are plotted in Figure 4. For fully NAT models, no beam search or NPD is considered. In all three setups, our fully NAT models obtain superior trade-off compared with AT and iterative NAT models. Iterative NAT models (LevT and CMLM) require multiple iterations to achieve reliable performance with the sacrifice of latency, especially for L CPU 1 and L GPU max where iterative NAT performs similarly or even worse than AT base (12-1), leaving fully NAT models a better position in quality-latency trade-off. Figure 4 also shows the speed advantage of fully NAT models shrinks in the setup of L CPU 1 and L GPU max where parallelism is constrained. NAT models particularly those trained with CTC cost more computations and memory compared to AT models with a shallow decoder. For instance when calculating L GPU max , we notice that the maximum allowed batch is 120K tokens for AT base (12-1), while we can only compute 15K tokens at a time for NAT with CTC due to the up-sampling step, even though the NAT models win the actual wall-clock time. We mark it as one limitation for future research.

Ablation Study
Impact of variant techniques Our fully NAT models benefit from many dependency reduction techniques in four aspects (data, model, loss function and learning). We analyze their effect on translation accuracy through various combinations in   Distillation corpus According to the , the best teacher model is relevant to the capacity of NAT models. We report the performance of models trained on real data and distilled data generated from AT base and big models in Table 5. For base models, both AT (12-1) and NAT achieve better accuracy with distillation, while AT benefits more by moving from base to big distilled data. On the contrary, the NAT model improves marginally indicating that in terms of the modeling capacity, our fully NAT model is still worse than AT model even with 1 decoder layer. It is not possible to further boost the NAT performance by simply switching the target to a better distillation corpus. Nonetheless, it is possible simultaneously increase the NAT capacity by learning in big size. As shown in Table 5, we can achieve superior accuracy compared to AT (12-1) with little effect on the translation latency (L GPU 1 ).
Upsampling Ratio (λ) for CTC Loss To meet the length requirements in CTC loss, we upsample the encoder output by a factor of 3 in our experiments. We also explore other possible values and report the performance in Table 6. The higher upsampling ratio provides a larger alignment space, leading to better accuracy. Nevertheless, with a large enough sampling ratio, further increase will not lead to the performance increase. Because of the high degree of parallelism, L GPU 1 speed is similar among these ratios. However, the model with a larger ratio has a clear latency drop on CPU or GPU with large batches.

Discussion and Future work
In this section, we go through the proposed four techniques again for fully NAT models. In spite of the success to close the gap with autoregressive models on certain benchmarks, we still see limitations when using non-autoregressive systems as mentioned in Table 1.
We and most of the prior research have repeatedly found that knowledge distillation (KD) is the indispensable dependency reduction components, especially for training fully NAT models. Nevertheless, we argue that due to the model agnostic property, KD may lose key information that is useful for the model to translate. Moreover, Anonymous (2021) pointed out KD does cause negative effects on lexical choice errors for low-frequency words in NAT models. Therefore, an alternative method that improves the training of NAT models over raw targets using such as GANs (Bińkowski et al., 2019) or domain specific discriminators (Donahue et al., 2020) might be the future direction.
Apart from KD, we also notice that the usage of CTC loss is another key component to boost the performance of fully NAT models across all datasets. As discussed in § 4.2, however, the need of up-sampling constrains the usage of our model on very long sequences or mobile devices with limited memory. In future work, it is possible to explore models to hierarchically up-sample the length with a dynamic ratio to optimize the memory usage.
Lastly, both experiments with VAE and GLAT prove that it is helpful but not enough to train NAT models with loss based on monotonic alignments (e.g. CTC) only. To work on difficult pairs such as JA-EN, it may be a better option to adopt stronger models to capture richer dependency information, such as normalizing flows (van den Oord et al., 2018;Ma et al., 2019) or non-parametric approaches (Gu et al., 2018b).

Related Work
Besides iterative and fully NAT models, some other works trying to improve the decoding speed of machine translation models from other aspects. One research lien is to mix AT and NAT models up. For example,  proposed a semi-autoregressive model which adopted nonautoregressive decoding locally but kept the autoregressive property in global. On the contrary, Kong et al. (2020) and Ran et al. (2020) introduced a local autoregressive NAT models which retained the non-autoregressive property in global.
Alternatively, there are also some works trying to improve the decoding speed of AT models directly. For example, model quantization and pruning have been widely studied as a way to improve the decoding speed (See et al., 2016;Junczys-Dowmunt et al., 2018;Kim et al., 2019;Aji and Heafield, 2020). Also teacher-student training can improve the translation accuracy of the student model with faster decoding speed (Junczys-Dowmunt et al., 2018).

Conclusion
In this work, we aim to minimize the performance gap between fully NAT and AT models. We investigate some dependency reduction methods from four perspective and carefully unite them with some necessary revisions. Experiments on three translation benchmarks demonstrate that our proposed models achieve state-of-the-art results of fully NAT models.