AligNART: Non-autoregressive Neural Machine Translation by Jointly Learning to Estimate Alignment and Translate

Non-autoregressive neural machine translation (NART) models suffer from the multi-modality problem which causes translation inconsistency such as token repetition. Most recent approaches have attempted to solve this problem by implicitly modeling dependencies between outputs. In this paper, we introduce AligNART, which leverages full alignment information to explicitly reduce the modality of the target distribution. AligNART divides the machine translation task into (i) alignment estimation and (ii) translation with aligned decoder inputs, guiding the decoder to focus on simplified one-to-one translation. To alleviate the alignment estimation problem, we further propose a novel alignment decomposition method. Our experiments show that AligNART outperforms previous non-iterative NART models that focus on explicit modality reduction on WMT14 En↔De and WMT16 Ro→En. Furthermore, AligNART achieves BLEU scores comparable to those of the state-of-the-art connectionist temporal classification based models on WMT14 En↔De. We also observe that AligNART effectively addresses the token repetition problem even without sequence-level knowledge distillation.


Introduction
In the neural machine translation (NMT) domain, non-autoregressive NMT (NART) models (Gu et al., 2018) have been proposed to alleviate the low translation speeds of autoregressive NMT (ART) models. However, these models suffer from degenerated translation quality (Gu et al., 2018;Sun et al., 2019). To improve the translation quality of NART, several studies on NART iteratively refine decoded outputs with minimal iterations (Ghazvininejad et al., 2019;Kasai et al., 2020a;Guo et al., 2020;Saharia et al., 2020); other recent * This work was done during an internship at Kakao Enterprise.
works target to improve NART without iteration (Qian et al., 2021;Gu and Kong, 2021). One of the significant limitations of non-iterative NART models is the multi-modality problem. This problem originates from the fact that the models should maximize the probabilities of multiple targets without considering conditional dependencies between target tokens. For example, in English-to-German translation, a source sentence "Thank you very much." can be translated to "Danke schön." or "Vielen Dank.". Under the conditional independence assumption, the non-iterative NART models are likely to generate improper translations such as "Danke Dank." or "Vielen schön." (Gu et al., 2018). For the same reason, other inconsistency problems such as token repetition or omission occur frequently in non-iterative NART (Gu and Kong, 2021).
There are two main methods for non-iterative NART to address the multi-modality problem. Some works focus on an implicit modeling of the dependencies between the target tokens (Gu and Kong, 2021). For example, Ghazvininejad et al. (2020), Saharia et al. (2020), and Gu and Kong (2021) modify the objective function based on dynamic programming, whereas Qian et al. (2021) provide target tokens to the decoder during training.
On the other hand, other works focus on an explicit reduction of the modality of the target distribution by utilizing external source or target sentence information rather than modifying the objective function. For example, Akoury et al. (2019) and Liu et al. (2021) use syntactic or semantic information; Gu et al. (2018), Zhou et al. (2020b), andRan et al. (2021) use the alignment information between source and target tokens. However, previous explicit modality reduction methods show suboptimal performance. Zhou et al. (2020b) and Ran et al. (2021) extract fertility (Brown et al., 1993) and ordering information in word alignments, which enables the modeling of several types of mappings except for many-to-one and many-to-many cases. We hypothesize that leveraging entire mappings significantly reduces the modality and is the key to performance improvement.
In this work, we propose AligNART, a noniterative NART model that mitigates the multimodality problem by utilizing complete information in word alignments. AligNART divides the machine translation task into (i) alignment estimation and (ii) non-autoregressive translation under the given alignments. Modeling all the type of mapping guides (ii) more close to one-to-one translation. In AligNART, a module called Aligner is simply augmented to NAT (Gu et al., 2018) which estimates alignments to generate aligned decoder inputs.
However, it is challenging to estimate the complex alignment information using only source sentence during inference. Specifically, Aligner should simultaneously predict the number of target tokens corresponding to each source token and their mapping. To overcome this problem, we further propose alignment decomposition which factorizes the alignment process into three subprocesses: duplication, permutation, and grouping. Each sub-process corresponds to much feasible sub-problems: one-to-many mapping, ordering, and many-to-one mapping, respectively.
Our experimental results show that AligNART outperforms previous non-iterative NART models of explicit modality reduction on WMT14 En↔De and WMT16 Ro→En. AligNART achieves performance comparable to that of the recent stateof-the-art non-iterative NART model on WMT14 En↔De. We observe that the modality reduction in AligNART addresses the token repetition issue even without sequence-level knowledge distillation (Kim and Rush, 2016). We also conduct quantitative and qualitative analyses on the effectiveness of alignment decomposition.

Background
Given a source sentence x = {x 1 , x 2 , ..., x M } and its translation y = {y 1 , y 2 , ..., y N }, ART models with encoder-decoder architecture are trained with chained target distributions and infer the target sentence autoregressively: p(y n |y <n , x). (1) At each decoding position n, the decoder of the model is conditioned with previous target tokens y <n = {y 1 , ..., y n−1 }, which is the key factor of performance in ART models. Previous target tokens reduce the target distribution modality and provide information about the target sentence. However, the autoregressive decoding scheme enforces the decoder to iterate N times to complete the translation and increases the translation time linearly with respect to the length of the target sentence. Non-iterative NART models (Gu et al., 2018;Sun et al., 2019;Sun and Yang, 2020) assume conditional independence between the target tokens to improve the translation speed: where N is the predicted target length to parallelize the decoding process. Non-iterative NART models provide only the length information of the target sentence to the decoder, which is insufficient to address the multi-modality problem.

Model Overview
Given the word alignments between the source and target sentences A ∈ {0, 1} N ×M , we factorize the task into (i) alignment estimation and (ii) translation with aligned decoder inputs as follows: where M and N are the lengths of the source and target sentences, respectively. Although we can also modify the negative log-likelihood loss to model dependencies between outputs such as connectionist temporal classification (CTC) loss (Graves et al., 2006), we focus on the effect of the introduction of alignment as additional information. AligNART is based on the encoder-decoder architecture, with an alignment estimation module called Aligner as depicted in Figure 1a. The encoder maps the embedding of the source tokens into hidden representations h = {h 1 , h 2 , ..., h M }.
Aligner constructs the aligned decoder inputs d = {d 1 , d 2 , ..., d N } as follows: where r n is the number of non-zero elements in the n-th row of A. Given the aligned decoder inputs, the decoder is guided to focus on a one-toone translation from d n to y n . One-to-one mapping significantly reduces the modality of the target distribution.
The key component of AligNART, Aligner, models a conditional distribution of alignments A given the source sentence x during training, and aligns encoder outputs using the estimated alignments during inference, as depicted in Figure 1b. The ground truth of the alignments is extracted using an external word alignment tool. However, alignment estimation given only the source sentence is challenging since the alignment consists of two components related with target tokens: • The number of target tokens that correspond to each encoder output h m .
• The positions of the target tokens to which h m corresponds.
The Aligner decomposes the alignment for effective estimation, which is described in Section 3.2.

Aligner
To alleviate the alignment estimation problem, we start by factorizing the alignment process as shown in Figure 1b. First, we copy each encoder output h m by the number of target tokens mapped to h m , which is denoted as c m = n A n,m . Given the duplicated encoder outputs h , we have to predict the positions of target tokens to which each element in h is mapped.
We further decompose the remaining prediction process into permutation and grouping, since noniterative NART models have no information about the target length N during inference. In the permutation process, h is re-ordered into d such that elements corresponding to the same target token are placed adjacent to each other. In the grouping process, each element in d is clustered into N groups by predicting whether each element is mapped to the same target token as the previous element. r n = m A n,m denotes the number of elements in the n-th group which is equivalent to r n in Equation 4. Finally, we can derive the decoder inputs d in Equation 4 by averaging the elements in each group in d . In summary, we decompose the alignment estimation task into three sequential sub-tasks: duplication, permutation, and grouping.

Alignment Decomposition
As shown in Figure 1b, we factorize the alignment matrix A into duplication, permutation, and grouping matrices that correspond to each process. h = {h 1,1 , ..., h 1,c 1 , ..., h M,1 , ..., d M,c M } denotes the duplicated encoder outputs where h i,j is the j-th copied element of h i . Similarly, d = {d 1,1 , ..., d 1,r 1 , ..., d N,1 , ..., d N,r N } denotes the permuted encoder outputs where d i,j is the jth element in the i-th group. The number of nonzero elements in the alignment matrix is defined as L = m c m = n r n .
Duplication Matrix Aligner copies h m by c m to construct the duplicated encoder outputs h with a duplication matrix D ∈ {0, 1} L×M . Let C m = m i=1 c i and C 0 = 0. Then, we can define D using c m as follows: We index h by the following rule: • For any h m,i and h m,j (i < j), which are matched to d x i ,y i and d x j ,y j , respectively, x i ≤ x j and y i ≤ y j .
The duplication matrix D contains similar information to fertility (Gu et al., 2018). Permutation Matrix Aligner re-orders h to construct d with a permutation matrix P ∈ {0, 1} L×L . Since all the indexed elements in h and d are distinct, the permutation matrix P is uniquely defined.
Grouping Matrix Aligner finally aggregates d to construct d, the aligned decoder inputs, with a grouping matrix G ∈ {0, 1} N ×L . Let R n = n i=1 r i and R 0 = 0. Then, G can be defined using r n as follows: We index d by the following rule: • For any d n,i and d n,j (i < j), which are matched to h x i ,y i and h x j ,y j , respectively, x i ≤ x j and y i ≤ y j .
We can derive the aligned decoder inputs by separately estimating the decomposed matrices D, P , and G, which approximately correspond to one-tomany mapping, ordering, and many-to-one mapping, respectively. The decomposed matrices have an easily predictable form while recovering the complete alignment matrix.

Training
Aligner consists of three prediction sub-modules: duplication, permutation, and grouping predictors. Each of them estimates the decomposed alignment matrix as follows: The duplication predictor learns to classify the number of copies of h m . The duplication loss is defined as follows: where p m is the predicted probability distribution of the duplication at the position m. To discriminate copied elements in h , we add copy position embedding to {h m,1 , ..., h m,cm } for the next two predictors.
The permutation predictor takes the duplicated encoder outputs h as inputs. We simplify the permutation prediction problem into a classification of the re-ordered position. For the permutation loss, we minimize the KL-divergence between the prediction P pred and the ground truth P GT .
Given the permuted encoder outputs, the grouping predictor conducts a binary classification task of whether d l is assigned to the same group as d l−1 . Let the label at the position l be g l . Then, we define g l from G as follows: The grouping loss is defined as follows: where p l is the predicted probability distribution of the grouping predictor at position l.
Our final loss function is defined as the sum of the negative log-likelihood based translation loss L T and alignment loss L A : where we set α = β = γ = 0.5 for all the experiments.

Inference
During inference, Aligner sequentially predicts the duplication, permutation, and grouping matrices to compute the aligned decoder inputs d as depicted in Figure 1b. The duplication predictor in Aligner infersĉ m at each position m; then, we can directly construct a duplication matrixD using Equation 5. The permutation predictor predicts the distribution of the target position P pred . We obtain a permutation matrixP that minimizes the KL-divergence as follows: We utilize the linear sum assignment problem solver provided by Jones et al. (2001) to findP . The grouping predictor infers the binary predictionsĝ l from the permuted encoder outputs. We construct a grouping matrixĜ usingĝ l and Equations 6 and 10. With a predicted alignment matrix A =Ĝ ·P ·D, Aligner constructs the decoder inputs using Equation 4, and the decoder performs translation from the aligned inputs.

Decoding Strategies
For the re-scoring based decoding method, we select candidates of alignments using the predicted distributions in the duplication and grouping predictors.
We identify m positions in the outputs of the duplication predictor, where the probability of the predicted class is low. We then construct a 2 mcandidate pool where the predictions in part of the m positions are replaced with the second probable class. Next, we identify the top-a candidates with the highest joint probabilities. Similarly, we construct a 2 l -candidate pool and identify b candidates in the grouping predictor for the a candidates. Finally, we rank a · b translations for the alignments candidates using a teacher ART model and select the best translation among them.

Architecture of AligNART
We use the deep-shallow (12-1 for short) Transformer (Vaswani et al., 2017) architecture (i.e., 12layer encoder and 1-layer decoder) proposed by Kasai et al. (2020b) for two reasons. First, a deeper encoder assists Aligner to increase the estimation accuracy of the alignment matrix during inference. Second, the deep-shallow architecture improves the inference speed since the encoder layer has no cross-attention module compared to the decoder layer. The architecture of the duplication, permutation, and grouping predictor is shown in the Appendix.

Alignment Score Filtering
Some alignment tools such as GIZA++ (Och and Ney, 2003) provide an alignment score for each sentence pair as a default. Samples with low alignment scores are more likely to contain noise caused by sentence pairs or alignment tools. For GIZA++, we filter out a fixed portion of samples with low alignment scores to ease the alignment estimation. Since the pair of long sentences tends to be aligned with a low score, we apply the same filtering portion for each target sentence length.
For WMT14 En-De dataset, we use preprocessing pipelines provided by fairseq 1 (Ott et al., 2019). For WMT16 En-Ro dataset, we use the preprocessed corpus provided by Lee et al. (2018). Preprocessed datasets share a vocabulary dictionary between the source and target languages. We use fast align (FA) (Dyer et al., 2013) and GIZA++ (GZ), which is known to be more accurate than fast align, as word alignment tools. All the corpus are passed to the alignment tools at the subword-level. We filter out samples where the maximum number of duplications exceed 16. We explain the details of the alignment processing in the Appendix. We use the sequence-level knowledge distillation method (KD) for the distillation set. Transformer ART models are trained to generate the distillation set for each translation direction.

Models and Baselines
We compare our model with several non-iterative NART baselines, and divide the non-iterative NART models into two types as aforementioned: implicit dependency modeling and explicit modality reduction (see Table 1). We also train the ART models and deep-shallow NAT for the analysis. Our models are implemented based on fairseq. AligNART is implemented based on the deep-shallow Transformer architecture. We set d model /d hidden to 512/2048 and the dropout rate to 0.3. The number of heads in multi-head attention modules is 8 except for the last attention module of the permutation predictor which is 1. We set the batch size to approximately 64K tokens for all the models we implement. All these models we implement are trained for 300K/50K steps on En-De/En-Ro datasets, respectively. For AligNART, we average 5 checkpoints with the highest validation BLEU scores in the 20 latest checkpoints.
For optimization, we use Adam optimizer (Kingma and Ba, 2015) with β = (0.9, 0.98) and = 10 −8 . The learning rate scheduling follows that of Vaswani et al. (2017), starting from 10 −7 and warms up to 5e-4 in 10K steps. We use the label smoothing technique with ls = 0.1 for the target token distribution and each row of permutation matrix. The translation latency is measured on an NVIDIA Tesla V100 GPU.  Ro. In explicit modality reduction, AligNART (FA) achieves higher BLEU scores than Distortion and ReorderNAT, which utilize the same alignment tool, since we leverage the entire alignment information rather than partial information such as fertility or ordering. Moreover, AligNART (GZ) significantly outperforms previous models for explicit modality reduction except for SNAT on En→Ro. In implicit dependency modeling, AligNART (GZ) outperforms Imputer and shows performance comparable to that of the state-of-the-art CTC-based model on En↔De by simply augmenting Aligner module to deep-shallow NAT. In this study, we focus on introducing complete information in word alignments; we do not modify the objective function,    which can be explored in the future work. Table 2 shows the BLEU scores with re-scoring decoding strategies of the non-iterative NART models. We set m = l = 4, a = 4, and b = 2 for 8 candidates. AligNART outperforms the baselines on En→De and Ro→En, and shows performance similar to that of GLAT on De→En. In non-iterative NART for explicit modality reduction, AligNART shows the best performance on En↔De and Ro→En.

Analysis of Aligner Components
In this section, we investigate the accuracy, example, and ablation results of Aligner components as shown in Table 3, 4, and 5, respectively. Note that we partially provide the ground truth D or P matrices during the accuracy measurement. Table 3, a comparison of accuracy between raw and distilled datasets shows that KD significantly decreases multi-modality of each component. After KD, Alig-NART shows marginally reduced accuracy on the raw dataset, but high prediction accuracy in each component on the distillation set, resulting in increased BLEU scores.

Knowledge Distillation In
Alignment Tool Before KD, AligNART using fast align and GIZA++ have accuracy bottlenecks in permutation and duplication predictors, respectively, as shown in Table 3. The results imply that the alignment tools have different degrees of multimodality on the D, P, and G matrices, which can be explored in the future work. Table 4 shows an example of addressing the multi-modality problem. Deepshallow NAT monotonically copies the encoder outputs and suffers from repetition and omission problems. AligNART (FA) does not show the inconsistency problems thanks to the well-aligned decoder inputs, which significantly reduces the modality of the target distribution. We also conducted a case study on predicted alignments and their translations during re-scoring as shown in the Appendix.

Ablation Study
We conduct an analysis of alignment estimation by ablating one of the predictors  during inference. We ablate each module in Aligner by replacing the predicted matrix with an identical matrix I. The results in Table 5 indicate that each module in Aligner properly estimates the decomposed information in word alignments. However, there is an exception in GIZA++ where many-toone mapping does not exist, resulting in performance equal to that without the grouping predictor. We observe that AligNART achieves BLEU scores comparable to those of CTC-based models on En↔De even with the ground truth word alignments of partial information.

Analysis of Modality Reduction Effects
To evaluate the modality reduction effects of Alig-NART, we conducted experiments on two aspects: BLEU score and token repetition ratio. Table 6 shows the BLEU scores on WMT14 En-De. For En→De, AligNART using fast align without KD achieves higher BLEU scores than previous models without KD and deep-shallow NAT with KD.
The results indicate that our method is effective even without KD, which is known to decrease data complexity (Zhou et al., 2020a). On the other hand, alignments from GIZA++ without KD are more complex for AligNART to learn, resulting in lower BLEU scores than deep-shallow NAT with KD. Ghazvininejad et al. (2020) measured the token repetition ratio as a proxy for measuring multimodality. The token repetition ratio represents the degree of the inconsistency problem. In Table 7, the token repetition ratio of AligNART is less than that of the CMLM-base (Ghazvininejad et al., 2019) of 5 iterations, AXE, and GLAT. We also observe that the decline in the token repetition ratio from Aligner is significantly larger than that from KD. Combined with the results from   information adequately alleviates the token repetition issue even in the case where the BLEU score is lower than that of deep-shallow NAT with KD.

Ablation Study
We conduct several extensive experiments to analyze our method further as shown in Table 8 and 9. Each of our method consistently improves the performance of AligNART. Cross Attention As shown in Table 8, we ablate the cross attention module in the decoder to observe the relationship between aligned decoder inputs and alignment learning of the cross attention module. We train AligNART and deep-shallow NAT without a cross attention module for comparison. AligNART without the cross attention module has a smaller impact on the BLEU score than the deep-shallow NAT. The cross attention module is known to learn alignments between source and target tokens (Bahdanau et al., 2015), and the result implies that aligned decoder inputs significantly offload the role of the cross attention module.
Deep-shallow Architecture Deep-shallow architecture heavily affects the BLEU scores of Alig-  NART as shown in Table 8. The results indicate that the deep encoder assists alignment estimation, whereas the shallow decoder with aligned inputs has a lower impact on performance degeneration.
Alignment Score Filtering We investigate the trade-off between the alignment score filtering ratio and BLEU score using AligNART (GZ) presented in Table 9. Samples with low alignment scores are more likely to contain noise caused by distilled targets or an alignment tool. We observe that filtering out of 5% of the samples improves the BLEU score in both the directions. Surprisingly, increasing the filtering ratio up to 20% preserves the performance thanks to the noise filtering capability.   introduce latent variables to model the complex dependencies between target tokens. Saharia et al. (2020) and Gu and Kong (2021) apply CTC loss to the NMT domain. Qian et al. (2021) provide target tokens to the decoder during training using the glancing sampling technique.

Alignment in Parallel Generative Models
In other domains, such as text-to-speech (Ren et al., 2019;Kim et al., 2020;Donahue et al., 2020), a common assumption is a monotonicity in the alignments between text and speech. Given this assumption, only a duration predictor is required to alleviate the length-mismatch problem between text and speech. On the other hand, modeling the alignment in the NMT domain is challenging since the alignment contains additional ordering and grouping information. Our method estimates an arbitrary alignment matrix using alignment decomposition.

Improving NMT with Enhanced Information
To alleviate the multi-modality problem of NART models, Gu et al. (2018), Akoury et al. (2019), Zhou et al. (2020b), Ran et al. (2021), and Liu et al. (2021) provide additional sentence information to the decoder. Alignment is considered as a major factor in machine translation (Li et al., 2007;Zhang et al., 2017). Alkhouli et al. (2018) decompose the ART model into alignment and lexical models. Song et al. (2020) use the predicted alignment in ART models to constrain vocabulary candidates during decoding. However, the alignment estimation in NART is much challenging since the information of decoding outputs is limited. In NART, Gu et al. (2018), Zhou et al. (2020b), andRan et al. (2021) exploit partial information from the ground truth alignments. In contrast, we propose the alignment decomposition method for effective alignment estimation in NART where we leverage the complete alignment information.

Conclusion and Future Work
In this study, we leverage full alignment information to directly reduce the degree of the multimodality in non-iterative NART and propose an alignment decomposition method for alignment estimation. AligNART with GIZA++ shows performance comparable to that of the recent CTCbased implicit dependency modeling approach on WMT14 En-De and modality reduction capability. However, we observe that AligNART depends on the quality of the ground truth word alignments, which can be studied in the future work. Furthermore, we can study on the combination of Alig-NART and implicit dependency modeling methods.

A Mappings in Alignment
In general, there are one-to-one, one-to-many, many-to-one, and many-to-many mappings excluding zero-fertility and spurious word cases (see Figure 2). Distortion and ReorderNAT cannot represent many-to-one, many-to-many, and spurious word cases. The grouping predictor in AligNART models many-to-one and many-to-many mappings. The addition of a spurious token, which is applied to AligNART (FA), enables us to address the spurious word case, which is explained in Section C.2.
During the experiments, we observe that the introduction of a spurious token degrades the performance for GIZA++. We guess the reason of the degradation is that alignment matrix from GIZA++ contains more than two times as many empty rows as that of fast align on WMT14 En-De.

B Architecture of Aligner
The duplication predictor and grouping predictor modules consist of a convolutional layer, ReLU ac-one-to-one one-to-many spurious word many-to-one many-to-many zero-fertility Figure 2: Types of mapping in word alignments. Row and colum correspond to the target and source tokens, respectively.
tivation, layer normalization, dropout, and a projection layer, same as the phoneme duration predictor in FastSpeech (Ren et al., 2019), which is a parallel text-to-speech model. The permutation predictor in Aligner consists of three encoder layers: pre-network, query/key network, and single-head attention module for the outputs. Note that the outputs of the pre-network are passed to the query and key networks. To prevent the predicted permutation matrix from being an identity matrix, we apply a gate function to the last attention module in the permutation predictor to modulate the probabilities of un-permuted and permuted cases. We formulate the output of gated attention as follows: P pred = sof tmax(M + QK T ) where σ is the sigmoid function and Q/K is the output of the query/key network, respectively. g is the probability of an un-permuted case. M is a diagonal mask matrix, where the values of the diagonal elements are −inf . I is an identical matrix and D g is a diagonal matrix with g as the main diagonal.
C Alignment Processing C.1 Word-to-subword Alignment To reduce the complexity of alignment, we further assume that the alignment process is conducted at the word-level. We decompose the alignment matrix into the source subword to source word matrix S and the source word to target subword matrix A ws as depicted in Figure 3. Since S is always given, A ws is the only target to be learned. First,  Figure 3: Example of word-to-subword matrix decomposition technique. Row and column correspond to input and output tokens, respectively. y i denotes the i-th subword of the target sentence. x i denotes the i-th word of the source sentence and x i j denotes the j-th subword of the i-th word of the source sentence.
we derive the source subword to target subword matrix A using the alignment tool. A ws is achieved by clipping the maximum value of A·S to 1. A ws reduces the search space because of the assumption that source tokens duplicate, permute, and group at the word-level. However, there is a trade-off between the simplicity and resolution of information.
The recovered source subword to target subword matrix A ws · S loses the subword-level information as shown in the rightmost matrix in Figure 3.

C.2 Filling Null Rows in Alignment Matrix
The output of the alignment tool usually contains empty rows which means that no aligned source token exists for certain target tokens. We select two strategies to fill the null rows: (i) copy the alignment from the previous target token, or (ii) introduce a special spurious token. For the second strategy, we concatenate a special spurious token at the end of the source sentence. If the current and previous target tokens belong to the same word, we follow (i). The remaining target tokens of the null alignment are aligned to the spurious token.

C.3 Details of Alignment Tool Configuration
For fast align, we follow the default setting for forward/backward directions and obtain symmetrized alignment with the grow-diag-final-and option. We apply the word-to-subword alignment technique and spurious token strategy for null alignments. For GIZA++, we apply the word-to-subword alignment technique and copy the alignment from the previous target token for null alignment. We set the alignment score filtering ratio to 5%.

D Case Study
To analyze various alignments and their translations during re-scoring decoding, we conduct a case study on WMT14 De→En validation set as shown in Figure 4. The two translations have different orderings: the telescope's tasks and the tasks of the telescope. In this sample, we observe that Alig-NART (i) can capture non-diagonal alignments, (ii) models multiple alignments, and (iii) translates corresponding to the given alignments.