DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder

Non-autoregressive machine translation (NAT) models have lower translation quality than autoregressive translation (AT) models because NAT decoders do not depend on previous target tokens in the decoder input. We propose a novel and general Dependency-Aware Decoder (DePA) to enhance target dependency modeling in the decoder of fully NAT models from two perspectives: decoder self-attention and decoder input. First, we propose an autoregressive forward-backward pre-training phase before NAT training, which enables the NAT decoder to gradually learn bidirectional target dependencies for the final NAT training. Second, we transform the decoder input from the source language representation space to the target language representation space through a novel attentive transformation process, which enables the decoder to better capture target dependencies. DePA can be applied to any fully NAT models. Extensive experiments show that DePA consistently improves highly competitive and state-of-the-art fully NAT models on widely used WMT and IWSLT benchmarks by up to 1.88 BLEU gain, while maintaining the inference latency comparable to other fully NAT models.


Introduction
Autoregressive translation (AT) systems achieve state-of-the-art (SOTA) performance for neural machine translation (NMT) and Transformer (Vaswani et al., 2017) encoder-decoder is the prevalent architecture. In AT systems, each generation step depends on previously generated tokens, resulting in high inference latency when output is long. Nonautoregressive translation (NAT) models (Gu et al., 2018) significantly accelerate inference by generating all target tokens independently and simultaneously. However, this independence assumption leads to degradation in accuracy compared to AT models, as NAT models cannot properly learn target dependencies. Dependency in prior works and our work takes its standard definition in NLP, i.e., syntactic relations between words in a sentence.
The mainstream NAT models fall into two categories: iterative NAT models and fully NAT models. Iterative NAT models (Gu et al., 2019;Ghazvininejad et al., 2019;Lee et al., 2018) improve translation accuracy by iteratively refining translations at the expense of slower decoding speed. In contrast, fully NAT models (Gu et al., 2018;Bao et al., 2022) have great latency advantage over AT models by making parallel predictions with a single decoding round, but they suffer from lower translation accuracy. In this paper, we aim at improving the translation accuracy of fully NAT models while preserving their latency advantage.
Previous research (Gu and Kong, 2021) argues that reducing dependencies is crucial for training a fully NAT model effectively, as it allows the model to more easily capture target dependencies. However, dependency reduction limits the performance upper bound of fully NAT models, since models may struggle to generate complex sentences. Previous studies show that multi-modality  is the main problem that NAT models suffer from (Huang et al., 2021;Bao et al., 2022), i.e., the target tokens may be generated based on different possible translations, often causing over-translation (token repetitions), under-translation (source words not translated), and wrong lexical choice for polysemous words. Table 1 Row3 shows all three multimodality error types from the highly competitive fully NAT model GLAT (Qian et al., 2021) with modeling only forward dependency (F-NAT) in our experiments. We observe that lack of complete dependency modeling could cause multi-modality errors. For example, for the source text (in German) "Woher komme ich?" in the last column of Table 1, "Woher" means both "where" and "how".

Under-Translation
Over-Translation Wrong Lexical Choice
Target Reference We 've done it in 300 communities around the world. Some people just wanted to call him King . Where am I from ? Who am I ?

F-NAT
We did it the world in 300 communities. Some people just wanted to call him him king. How do I come from ? Who am I ?

FB-NAT
We 've done it in 300 communities around the world. Some people just wanted to call him king. Where do I come from? Who am I ? Table 1: Case studies of our proposed FBD approach on the highly competitive fully NAT model GLAT (Qian et al., 2021) for alleviating three types of multi-modality errors on the IWSLT16 DE-EN validation set. Repetitive tokens are in red. Source words that are not semantically translated are in bold and underlined. Wrong lexical choices (for polysemous words) and redundant words are in blue. F-NAT denotes only modeling forward dependencies while FB-NAT denotes modeling both forward and backward dependencies, the same as the models in Table 5. Case studies of our proposed IT approach are in Appendix.
The NAT model modeling only forward dependency (F-NAT) incorrectly translates "woher" into "how" and outputs "How do I come from?"; whereas the model modeling both forward and backward dependency (FB-NAT) translates it correctly into "Where do I come from?". Therefore, instead of dependency reduction, we propose a novel and general Dependency-Aware Decoder (DePA), which enhances the learning capacity of fully NAT models and enables them to learn complete and complex forward and backward target dependencies in order to alleviate the multi-modality issue.
Firstly, we enhance the NAT decoder to learn complete target dependencies by exploring decoder self-attention. We believe that previous works (Guo et al., 2020a) incorporating only forward dependency modeled by AT models into NAT models are inadequate to address multi-modality. Therefore, we propose an effective forward-backward dependency modeling approach, denoted by FBD, as an auto-aggressive forward-backward pre-training phase before NAT training, using curriculum learning. The FBD approach implements triangular attention masks and takes different decoder inputs and targets in a unified framework to train the model to attend to previous or future tokens and learn both forward or backward dependencies.
Secondly, we enhance target dependency modeling within the NAT decoder from the perspective of the decoder input. Most prior NAT models (Gu et al., 2018;Wei et al., 2019) use a copy of the source text embedding as the decoder input, which is independent from the target representation space and hence makes target dependency modeling difficult. We transform the initial decoder input from the source language representation space to the target language representation space through a novel attentive transformation process, denoted by IT. Previous works on transforming the decoder input cannot guarantee that the decoder input is in the exact target representa-tion space, resulting in differences from the true target-side distribution. Our proposed IT ensures that the decoder input is in the exact target representation space hence enables the model to better capture target dependencies.
Our contributions can be summarized as follows: (1) We propose a novel and general Dependency-Aware Decoder (DePA) for fully NAT models. For DePA, we propose a novel approach FBD for learning both forward and backward dependencies in NAT decoder, through which the target dependencies can be better modeled. To the best of our knowledge, our work is the first to successfully model both forward and backward target-side dependencies explicitly for fully NAT models. We also propose a novel decoder input transformation approach (IT). IT could ease target-side dependency modeling and enhance the effectiveness of FBD. DePA is model-agnostic and can be applied to any fully NAT models. (2) Extensive experiments on WMT and IWSLT benchmarks demonstrate that our DePA consistently improves the representative vanilla NAT model (Gu et al., 2018), the highly competitive fully NAT model GLAT (Qian et al., 2021)

Related Work
Forward and Backward Dependencies Prior works explore bidirectional decoding to improve modeling of both forward and backward depen-dencies in phrase-based statistical MT (Finch and Sumita, 2009) and RNN-based MT (Zhang et al., 2018). For NAT, Guo et al. (2020a) and Wei et al. (2019) use forward auto-regressive models to guide NAT training. Liu et al. (2020) introduces an intermediate semi-autoregressive translation task to smooth the shift from AT training to NAT training. However, backward dependencies are rarely investigated in NAT.
Decoder Input of Fully NAT Models The decoder input of AT models consists of previously generated tokens. However, selecting appropriate decoder input for fully NAT models could be challenging. Most prior NAT models (Gu et al., 2018;Wei et al., 2019) use uniform copy (Gu et al., 2018) or soft copy (Wei et al., 2019) of the source text embedding as the decoder input, which is independent of the target representation space hence hinders target dependency modeling. Methods such as GLAT (Qian et al., 2021) and (Guo et al., 2020a,b) attempt to make the NAT decoder input similar to the target representation space by substituting certain positions in the decoder input with the corresponding target embedding. However, this creates a mismatch between training and inference. Guo et al. (2019) uses phrase-table lookup and linear mapping to make the decoder input closer to the target embedding, but this method still causes difference between the decoder input and the real target-side distribution.
Fully NAT Models To address multi-modality for fully NAT models, various approaches are proposed. Gu et al. (2018) uses knowledge distillation (KD) (Kim and Rush, 2016) to reduce dataset complexity. Libovickỳ and Helcl (2018) and Saharia et al. (2020) use connectionist temporal classification (CTC) (Graves et al., 2006) for latent alignment.  utilizes CRFs to model target positional contexts. Kaiser et al. (2018),  and Shu et al. (2020) incorporate latent variables to guide generation, similar to VAEs (Kingma and Welling, 2013). Guo et al. (2020c) initializes NAT decoders with pretrained language models. Huang et al. (2021) proposes CTC with Deep Supervision and Layer-wise Prediction and Mixed Training (CTC-DSLP-MT), setting new SOTA for fully NAT models on WMT benchmarks. DA-Transformer  represents hidden states in a directed acyclic graph to capture dependencies between tokens and gener-ate multiple possible translations. In contrast, our DePA utilizes forward-backward pre-training and a novel attentive transformation of decoder input to enhance target dependency modeling. Under same settings and with KD, DA-Transformer performs only comparably to CTC-DSLP-MT; however, performance of DA-Transformer benefits notably from Transformer-big for KD while CTC-DSLP-MT uses Transformer-base for KD. DDRS w/ NMLA (Shao and Feng, 2022) benefits greatly from using diverse KD references while CTC-DSLP-MT uses only a single KD reference. Hence, CTC-DSLP-MT is still the current SOTA for fully NAT models on WMT benchmarks.
Non-autoregressive Models Besides fully NAT models, iterative NAT models are proposed such as iterative refinement of target sentences (Lee et al., 2018), masking and repredicting words with low probabilities (Ghazvininejad et al., 2019), editbased methods to iteratively modify decoder output (Stern et al., 2019;Gu et al., 2019), and parallel refinement of every token (Kasai et al., 2020). Iterative NAT models improve translation accuracy at the cost of slower speed. Non-autoregressive models are practically important due to high efficiency. Other than MT, they are applied to various tasks such as image captioning (Gao et al., 2019), automatic speech recognition (Chen et al., 2019), and text-to-speech synthesis (Oord et al., 2018).

Problem Formulation
NMT can be formulated as a sequence-to-sequence generation problem. Given a sequence X = {x 1 , ..., x N } in the source language, a sequence Y = {y 1 , ..., y T } in the target language is generated following the conditional probability P (Y |X). NAT models are proposed to speed up generation by decoding all the target tokens in parallel, using conditional independent factorization as: where the target sequence length T is modeled by the conditional distribution P L , and dependence on previous target tokens is removed. Compared to AT models, NAT models speed up inference significantly at the expense of translation quality, because the conditional independence assumption in Eq.1 enables parallel processing but lacks explicit modeling of dependency between target tokens. To enhance target dependency modeling, we propose two innovations as incorporating both forward and backward dependency modeling into the training process (Section 3.2) and transforming the decoder input into the target representation space (Section 3.3).

Target Dependency Modeling with Curriculum Learning (FBD)
Prior work (Guo et al., 2020a) utilizes forward dependency in AT models to initialize model parameters for NAT. However, as discussed in Section 1, for fully NAT models, only modeling forward dependency is inadequate for addressing the multimodality problem (Finch and Sumita, 2009;Zhang et al., 2018) (the Row for F-NAT in Table 1). Our innovations include incorporating both forward and backward dependency modeling into NAT models, via triangular attention masks in a unified framework through curriculum learning (Figure 1), and investigating efficacy of different curricula. In Figure 1, the NAT decoder phase denotes standard NAT training of any NAT decoder Dec. The Forward Dependency and Backward Dependency phases serve pre-training for NAT training, learning left-to-right and right-to-left dependencies to initialize NAT models with better dependencies. Forward Dependency and Backward Dependency training phases apply the same upper triangle attention mask on Dec. We use KD data from AT models for each phase but the inputs and the targets are different. The Forward Dependency training phase uses y 1 to predict y 2 and so on. The Backward Dependency training phase reverses the target sequence and uses y 2 to predict y 1 and so on. The NAT Train-ing phase uses features of each word to predict the word itself. We make the following hypotheses: (1) Considering the nature of languages, learning forward dependency in Phase 1 is easier for the model for language generation.
(2) Modeling backward dependency relies on learned forward dependency knowledge, hence it should be in the second phase. In fact, we observe the interesting finding that the best curriculum remains forwardbackward-forward-NAT (FBF-NAT) for both left-branching and right-branching languages, proving our hypotheses. We speculate that NAT training may benefit from another forward dependency modeling in Phase 3 because the order of left-to-right is more consistent with characteristics of natural languages, hence adding the second forward dependency modeling after FB (i.e., FBF) smooths the transition to the final NAT training. Detailed discussions are in Section 4.3.

Decoder Input Transformation (IT) for Target Dependency Modeling
Given the initial decoder input z as a copy of source text embedding, we propose to directly select relevant representations from target embedding to form a new decoder input z ′ (Figure 2). z is used as the query and the selection is implemented as a learnable attention module. The learnable parameters bridge the gap between training and inference while the selection guarantees consistency between the decoder input matrix and the target representation space (i.e., the output embedding matrix of the decoder). This way, the decoder input is in the exact target-side embedding space and more conducive to modeling target dependencies for NAT models than previous approaches using source text embedding or transformed decoder input.
Decoder Input Transformation To transform z into the target representation space, we apply attention mechanism between z and the output embedding matrix Emb ∈ R d×v , where d and v denote sizes of hidden states and the target vocabulary.
Since NAT models usually have embedding matrix Emb including both source and target vocabularies, first, we conduct a filtering process to remove source vocabulary (mostly not used by the decoder) from the decoder output embedding matrix (the linear layer before decoder softmax). We build a dictionary that contains only target-side tokens in the training set. We then use this dictionary to filter Emb and obtain the new output embedding matrix of the decoder Emb ′ ∈ R d×v ′ , where v ′ denotes size of the filtered vocabulary. This filtering process guarantees that Emb ′ is strictly from the target representation space. The attention process starts with a linear transformation: Next, the dot-product attention is performed on z l (as query) and Emb ′ (as key and value): Sim represents similarity between each z l i and each embedding in the target vocabulary. Finally, we compute a weighted sum z ′ of target embedding based on their similarity values: Since z ′ is a linear combination of Emb ′ which is strictly in the target representation space, z ′ is also strictly in the target representation space, hence using z ′ as the decoder input provides a more solid basis for target dependency modeling.

Target-side Embedding Compression
To reduce the computational cost of IT, we propose a target-side embedding compression approach to compress the large target embedding matrix. We process Emb ′ through a linear layer to obtain a new target embedding Emb * ∈ R d×v * : where W c ∈ R v * ×v ′ is trainable and the size of compressed vocabulary v * is set manually. The result Emb * is still in the target representation space.
Since we can manually set v * as a relatively small number (e.g., 1000,2000), the computational cost of the attention mechanism can be greatly reduced.
We hypothesize that target-side embedding compression may also alleviate over-fitting on small datasets and confirm this hypothesis in Section 4.3.

Experimental Setup
Datasets We compare our methods with prior works on widely used MT benchmarks for evaluating NAT models: WMT14 EN↔DE (4.5M pairs), WMT16 EN↔RO (610K pairs). Also, we use IWSLT16 DE-EN (196K pairs), IWSLT14 DE-EN (153K pairs), and SP EN-JA 2 (50K pairs) for further analysis. For WMT16 EN↔RO and IWSLT16 DE-EN, we adopt the processed data from (Lee et al., 2018). For WMT14 EN↔DE, we apply the same preprocessing and learn subwords as Gu and Kong (2021). For IWSLT14 DE-EN, we follow preprocessing in (Guo et al., 2019). For SP EN-JA, we use sentencepiece 3 to tokenize the text into subword units following Chousa et al. (2019). Following prior works, we share the source and target vocabulary and embeddings in each language pair in Emb, except EN-JA. Also following prior works (Gu et al., 2018;Qian et al., 2021), all NAT models in our experiments are trained on data generated from pre-trained AT Transformer-base with sequence-level knowledge distillation (KD) for all datasets except EN-JA.

Baselines and Training
We implement the baseline models based on their released codebases. We implement the representative vanilla NAT (Gu et al., 2018;Qian et al., 2021;Huang et al., 2021)  single NVIDIA V100 GPU, then compute the average time per sentence. We report Speed-up based on the inference latency of Transformer-base AT (teacher) and fully NAT models. Table 2 shows the main results on the WMT benchmarks. For EN↔RO, we report the mean of BLEU from 3 runs with different random seeds for Row 12-13, all with quite small standard deviations (≤ 0.16) 7 . We apply our proposed DePA, which includes IT and FBD, to vanilla NAT, GLAT, and the current fully NAT SOTA CTC-DSLP-MT, on WMT, IWSLT, and EN-JA benchmarks. We use the same hyperparameters and random seeds to fairly compare two models. It is crucial to point out that accuracies of vanilla NAT, GLAT, and CTC-DSLP-MT models have plateaued out after 300K training steps on WMT datasets hence original papers of these three models set max training steps to 300K. We verify this observation in our own experiments as we also see no gains on these models after 300K training steps on the WMT datasets. Hence, although our DePA trains 300K × 4 = 1200K steps on WMT datasets due to FBF pre-training as in Section 4.3, all comparisons between baselines w/ DePA and w/o DePA are fair comparisons. Table 2 shows that DePA consistently improves the translation accuracy for both vanilla NAT and GLAT on each benchmark, achieving mean=+1.37 and max=+1.88 BLEU gain on GLAT and mean=+2.34 and max=+2.46 BLEU gain on vanilla NAT. DePA also improves the SOTA CTC-DSLP-MT by mean=+0.42 and max=+0.49 BLEU gain on the WMT test sets (Table 2), +0.85 BLEU gain on the IWSLT16 DE-EN validation set and +1.43 BLEU gain on the EN-JA test set (Table 3). All gains from DePA on vanilla NAT, GLAT, and CTC-DSLP-MT are statistically significant (p < 0.05) based on a paired bootstrap resampling test conducted using 1K resampling trials and the SacreBLEU tool.   (Table 2 Row13-over-Row12, Table 3), hence we expect DePA to also improve DA-Transformer  and DDRS w/ NMLA (Shao and Feng, 2022) and will verify this w/ and w/o KD in future work. Applying DePA to fully NAT models retains the inference speed-up advantages of fully NAT models. Applying DePA to vanilla NAT, GLAT, and SOTA CTC-DSLP-MT obtain 15.4×, 15.1×, and 14.7× speed-up over the autoregressive Transformer-base (teacher) (Row1). Overall Table 2 shows that DePA achieves greater BLEU gains with less speed-up loss than DSLP on all baselines. These results demonstrate superiority of DePA over DSLP on improving other fully NAT models.

Analysis
Ablation Study We analyze the respective efficacy of IT and FBD in DePA on the IWSLT16 DE-EN validation and the WMT and SP EN-JA test sets. Table 3 shows that FBD and IT improve GLAT by +1.26 BLEU/+1.5 ChrF and +0.34 BLEU/+1.0 ChrF on IWSLT16 DE-EN validation set, respectively. Considering that GLAT w/FBD has more training steps than GLAT, we also train GLAT (400K steps) which has the same training steps as GLAT w/FBD for fair comparison. Similar to findings on WMT datasets, we observe plateaus of accuracy on IWSLT and EN-JA datasets from more training steps than the original 100K. Just training more steps hardly improves the baseline (only +0.07 BLEU gain) on IWSLT16 DE-EN, whereas GLAT w/FBD brings +1.19 BLEU/+1.2 ChrF gains over GLAT (400K steps). Table 4 shows our IT outperforms Linear Mapping (Guo et al., 2019) by +2.31 BLEU gain on IWSLT14 DE-EN test set. IT has the same number of extra parameters as Linear Mapping. Hence, the large gain proves that improvements from IT are not just from additional layers. The number of extra parameters of IT, as from W q in Eq.2, is quite small: 512*512=262144 for Transformerbase on WMT datasets and 256*256=65536 for Transformer-small on IWSLT datasets. The large BLEU gain +3.18 from applying IT to vanilla NAT proves vanilla transformer decoder cannot achieve similar transformation effectiveness as IT. Table 3 shows that for language pairs with different levels of source-target vocabulary sharing, such as WMT EN-DE and DE-EN, IWSLT DE-EN, EN-RO, and EN-JA, our IT method can achieve consistent improvements over GLAT and CTC-DSLP-MT. Applying IT consistently improves GLAT and CTC-DSLP-MT although these gains are smaller than gain on vanilla NAT. This is because decoder input of vanilla NAT only replicates source embedding, whereas GLAT and CTC-DSLP-MT already transform decoder input by replacing selected positions in decoder input with target embedding, hence reducing improvements of IT. Still, gains from w/IT+FBD over w/FBD confirms our hypothesis that IT can enhance effectiveness of FBD. On GLAT, IT+FBD yields +1.4 BLEU/+2.   To further analyze IT, we compare cosine similarity between the target embedding against the original decoder input and the transformed decoder input, respectively. For each sample in the IWSLT16 DE-EN validation set, we average all its token embeddings as the decoder input representation and the same for the target representation and then compute cosine similarity. We average similarities of all samples as the final similarity. We find that IT significantly improves similarity between the decoder input and the target representation, 0.04951 → 0.14521 for GLAT and 0.04837 → 0.14314 for vanilla NAT. Table 5 presents results from applying different forward-backward dependency modeling curricula (Figure 1) on GLAT on the IWSLT16 DE-EN validation and the SP EN-JA test sets. Compared with modeling backward dependency in Phase 1 (B-NAT and BF-NAT), modeling forward dependency in Phase 1 (F-NAT , FB-NAT, and FBF-NAT) performs notably better. FB-NAT outperforms BF-NAT by +3.04 BLEU on IWSLT16 DE-EN and +2.08 BLEU on EN-JA. It seems that forward dependency modeling achieves good initialization for subsequent training phases, while backward dependency modeling cannot. We observe the best curriculum as FBF-NAT, i.e., first learn forward dependency, next learn backward dependency, then another round of forward dependency training before NAT training. Table 5 shows the same trend of curricula on SP EN-JA as on IWSLT16 DE-EN, with FBF-NAT performing best, demonstrating that this trend of forwardbackward dependency modeling curricula is consistent for both right-branching (English) and leftbranching (Japanese) target languages. All these observations confirm our hypotheses in Section 3.2. Our FBF-NAT consistently outperforms baseline GLAT (denoted by NAT in Table 5) by +1.58 on IWSLT16 DE-EN and +1.56 on SP EN-JA and outperforms prior works modeling forward dependency only (Guo et al., 2020a) on GLAT (denoted by F-NAT in   (Figure 1). For example, F-NAT denotes forward dependency training then NAT training.  Table 5. embedding for IT (Section 3.3). We use the dichotomy to determine the compression dimension interval [1000,2000] and evaluate GLAT w/ IT using different dimensions with step size 200 in this interval for IT on the IWSLT16 DE-EN validation set. As shown in Table 6, applying IT on GLAT improves BLEU up to +0.78 (29.61→ 30.39) with compressed dimension 1800. We also experiment with target-side embedding compression on a larger model on WMT16 EN-RO but find no gains. We assume that for relatively small models and data, this approach helps filter out some redundant target information, hence refines the target representation space and improves the translation accuracy.  are corrected by incorporating both forward and backward dependency modeling through FBD. For a more intuitive analysis of FBD, we present a visualization of the decoder self-attention distribution of different NAT models in Figure 3. All models are based on GLAT and model names conform to those in Table 5. In the baseline GLAT (Figure 3a), the self-attention distribution of each position is scattered in adjacent positions, indicating that the NAT model lacks dependency and has high confusion during decoding, causing multi-modality errors. In F-NAT and B-NAT models, significant forward and backward dependencies can be observed in Figure 3b and 3c, indicating that these two models can better use information in previous or future positions. Encouragingly, forward and backward dependencies are fused in the FB-NAT model (Figure 3d), which can focus on future information while modeling forward dependency, capable of alleviating problems shown in Table 1.

Conclusion
We propose a novel and general Dependency-Aware Decoder (DePA) to enhance target dependency modeling for fully NAT models, with forward-backward dependency modeling and decoder input transformation. Extensive experiments show that DePA improves the translation accuracy of highly competitive and SOTA fully NAT models while preserving their inference latency. In future work, we will evaluate DePA on iterative NAT models such as Imputer, CMLM, and Levenshtein Transformer and incorporate ranking approaches into DePA.

Limitations
Apart from all the advantages that our work achieves, some limitations still exist. Firstly, in this work, we investigate the efficacy of applying our proposed DePA approach on the representative vanilla NAT, the highly competitive fully NAT model GLAT and current SOTA CTC-DSLP-MT for fully NAT models, but we have yet to apply DePA to iterative NAT models, such as Imputer (Saharia et al., 2020), CMLM (Ghazvininejad et al., 2019), and Levenshtein Transformer (Gu et al., 2019). Hence, the effectiveness of DePA on iterative NAT models still needs to be verified. Secondly, we have not yet incorporated reranking approaches such as Noisy Parallel Decoding (NPD) (Gu et al., 2018) into DePA. Thirdly, our proposed method FBD requires multiple additional training phases before NAT training, resulting in longer training time and using more GPU resources. Reducing the computational cost of FBD training is one future work that will be beneficial for energy saving. Last but not least, NAT models have limitations on handling long text. They suffer from worse translation quality when translating relatively long text. We plan to investigate all these topics in future work.
Target Reference even though they were caught , they were eventually this is a blueprint for countries like China and Iran . released after heavy international pressure .
Vanilla NAT although they were caught , they were released released this is a blueprint plan for countries like China and and Iran . because because of huge drug .

Vanilla NAT w/ IT
although they were caught , they were finally released this is a blueprint for countries like China and Iran . because huge international pressure .

GLAT
although they were caught , they finally were released this is a blueprint plan for countries like China and Iran . because of of international printing .

GLAT w/ IT
although they were caught , they were finally this is a blueprint for countries like China and Iran . released after huge international pressure . Table 7: Case studies of our method IT on the IWSLT16 DE-EN validation set by comparing the translations from the two baseline models Vanilla NAT and GLAT and from them after applying IT (models in bold). Repetitive tokens are in red. Source words that are not semantically translated are marked in bold and underlined (under-translation). Wrong lexical choice (incorrect translations caused by polysemy) and redundant words are in blue.