Assessing Non-autoregressive Alignment in Neural Machine Translation via Word Reordering

,


Introduction
Non-autoregressive neural machine translation (NAT) (Gu et al., 2018) takes advantage of the parallel architecture of transformer (Vaswani et al., 2017) to alleviate the translation latency issue in neural machine translation (NMT), achieving significant speed-up.Yet it suffers from the multimodality problem, where a target token could be a result of different possible translations.Word order errors are often resulted as compared to the autoregressive counterparts (Du et al., 2021), arising from the lack of dependency amongst target tokens in NAT models.
Some recently proposed NAT models can achieve comparable performance to autoregres-sive models.This can be attributed to various approaches that reduce the dependency in handling word order errors via word alignment mechanisms (Gu and Kong, 2021).In particular, latent variables and alignments have been adopted for implicitly modelling the dependencies among the target tokens (Song et al., 2021).While the latent alignment approach assumes monotonic alignment between the source and target language pair when handling token shifts in the output space (Gu and Kong, 2021), explicit modality reduction methods (Zhou et al., 2020;Shu et al., 2020;Ran et al., 2021;Song et al., 2021) on the other hand sought to directly align the source and target language pair.Despite some previous work being sub-optimal, recent work in this direction achieves state-of-the-art (sota) results rivaling that of implicit dependency modeling methods.
Establishing explicit alignment between tokens in parallel sentences of source and target languages typically involves fertility prediction and token reordering prediction.In this paper, we focus on the latter and argue that improving the reordering performance can contribute greatly towards the performance of NAT models.With the sole exception of Shu et al., 2020, architectural design of the aforementioned NAT models includes a reordering sub-module as a key component.We therefore set forward to review in detail the capabilities of the various reordering mechanisms proposed in the NAT models.We then propose a novel way to achieve the reordering prediction by learning a nonautoregressive language model (NALM) based on transformer with Viterbi decoding (Viterbi, 1967) combined.
We evaluated the reordering sub-modules extracted from the various NAT models and variants of our proposed NALM using the PTB dataset (Marcus et al., 1993) where sentences with words permuted in different ways are expected to have their ordering recovered.In particular, we adopt different degrees of permutation to mimic various levels of monotonicity (or reordering difficulty) between the source and target sentences.Our experimental results show that the proposed NALM achieves significant and consistent improvement compared to the reordering sub-modules extracted from explicit modality reductionist NAT models in all word permutation settings.Our experiment also advances the sota performance of the word reordering task in low beam setting and achieves comparable performance with autoregressive models even in high beam setting (b=64) while maintaining a constant time complexity.

Non-autoregressive Language Modelling
In this section, we will first provide the formulation of the word reordering task and then present our proposed solution by taking a non-autoregressive language modelling approach.

Problem definition
The word reordering problem is formulated as: P (Y |Y ′ ) = P (y 0 , y 1 , ..., y T |y π(0) , y π(1) , ..., y π(T ) ) (1) where Y ′ = y π(0) , y π(1) , ..., y π(T ) is a permutation of Y .We first follow the previous word reordering work (Hasler et al., 2017), in which we remove the permutation information and learn to recover the order of sequence Y from the corresponding bag of words {Y }.The formulation is thus revised as: P (Y |{Y }) = P (y 0 , y 1 , ..., y T |{y 0 , y 1 , ..., y T }) (2) where {y 0 , y 1 , ..., y T } denotes a set of Y .This can be approximated as: so that each token's probability is now conditioned to the token immediately preceding it as well as to the entire bag of tokens in the sequence.

NALM
The training setup of a standard transformer decoder in NMT naturally conforms to the above formulation, as it learns the conditional probability P (y t |y t−1 , X).Since our model does not involve translating from X to Y , the inter-attention layer can thus be removed and the decoder becomes a standard transformer encoder.However we still need to include bag of words {Y } into our modeling.This can be achieved by replacing the causal attention with full attention and removing the position embedding of the input in the model.We further replace the output layer with a pointer network to constraint the output space to only the tokens (including repetition) within the concerned sequence.The entire model is thus formulated as: where bos and eos refer to the beginning and the end of sentence tokens respectively.O is the output of the pointer network.We train the model by minimizing the cross entropy.
In the pointer network, we utilize the input sequence as the vocabulary and output a nonnormalized matrix which can yield a probability matrix via softmax (see Figure 1(e) and 1(f)).This output probability matrix can be viewed as a trellis containing the transition probabilities of each input token to the rest of its neighbours.The optimal path that traverses this trellis would guarantee the most probable sequence of transitions using the well-known Viterbi algorithm (Viterbi, 1967).

NALM-pos
The NALM learns a probability distribution of sequences which essentially allows its reconstruction by considering the input as a bag of tokens.Yet, modeling the underlying permutation mechanism between sequences and their permutations could also be useful for achieving better reordering.In order to capture the mechanism as well, we extend NALM with position information.Our extended model NALM-pos learns the following probabilities: The advantage of P (Y |Y ′ ) over P (Y |{Y }) is that it retains certain ordering information of the sequence albeit permutation from the ground truth order.When much of the permutation order resembles the ground truth order, retaining such position information will significantly reduces the complexity of the learning task.In NAT, the reordering sub-module receives a transformed source sentence as input, which generally still follows source word order.The target of the sub-module on the other hand follows the target word order.Yet, these word orders are often shared between languages.The more similar these orders are, the more monotonic the two languages are.Even in less monotonic language pair such as JA-EN, orders are shared to some extent.This further accounts for the importance of incorporating position information.

Experimental Setup
We use the English Penn Treebank data (Marcus et al., 1993) in our evaluation, preprocessed (in various ways) as described in the section.

Data and evaluation
Following Hasler et al., 2017, we conduct experiments on the data preprocessed as in Schmaltz  et al., 2016 for fair comparison. 1 This dataset is fully shuffled on the token level, and we refer it as ptb2016.We further create 3 datasets based on the preprocessed data to simulate reordering data that would more likely be encountered in NAT alignment.They simulate reordering data of varying difficulty.We start by ngramizing the sentences to simulate phrases commonly found in phrase-based statistical machine translation.We argue that the different ordering between parallel sentences of two languages involves predominantly movement of these phrases (local orderings), and therefore permuting these ngrams will provide datasets which resemble better the challenges faced in real alignment during NAT.We employ two methods in permuting these ngrams, either by randomly permuting a percentage of them (0.4 and 0.6), or by adjacently displace-and-combine pairs of ngrams recurrently based on a preset probability (0.5).We refer them as the r04, r06 and d05 datasets respectively.We use quadgram with backoff during ngramization and ngrams with count above 2.

Model settings
For all the models, we follow the trans-former_base_v3 hparams set as defined in ten-sor2tensor (Vaswani et al., 2018).We train the model with a total of 100k steps with a batch size of 65, 536.

Benchmark models
We describe the reordering modules extracted from the existing NAT models involved in the evaluation as follow.We apply the Hungarian algorithm (Kuhn, 1955) to the output matrix for all the reordering modules to obtain the final order of the permuted sequence.

ReorderNAT-r
The reordering module of ReorderNAT (Ran et al., 2021) is a transformer decoder which we replace with an encoder given the monolingual setting of the word reordering task, otherwise unchanged.

SNAT-r
The SNAT (Shu et al., 2020) is an explicit modality reductionist model.Its alignment mechanism is achieved via latent regularization to the model's transformer decoder.We replace the decoder with an encoder as suitable for word reordering task and upon it implement the regularization.

Distortion-r
The distortion model (Zhou et al., 2020) makes use of a distortion predictor to predict alignment by taking encoder output as input.We retain their encoder and distortion predictor for the word reordering task, with the fertility predictor and the decoder removed.In their work, they tried both absolute position and relative position information in their distortion predictor.We only experiment with the relative position distortion predictor because of its superior performance as reported.

AligNART-r
AligNART (Song et al., 2021) is currently the sota method using the explicit NAT approach.Its performance is also comparable to the sota in the field of NAT.For adaptation to the reordering task, we remove the decoder as well as the duplication predictor and the grouping predictor in the aligner, leaving only the permutation predictor.We use a 6-layer encoder to fit the task setting.We train the adapted model only by minimizing the KLdivergence, i.e. the permutation predictor loss in the original work.2 4 Results

Word reordering on the Penn Treebank
Table 2 shows that NALM can outperform all other reordering mechanisms adapted from the existing NAT models by at least 8 BLEU.It furthermore surpasses the transformer baseline (b=5) by 1 BLEU.Since this paper aims to study reordering mechanism in NAT, we do not include baseline transformer's performance in higher beam settings, as the lengthy decoding time would defeat the purpose of fast and efficient NAT approaches.However even when pitched against past works with higher beam settings (e.g., b=64), NALM still compares.3Amongst the benchmark models, Song et al., 2021 fails to converge during training while Zhou et al., 2020 fails to recover any meaningful ordering even when fully trained.The disappointing performance of the adapted reordering mechanisms can be attributed to their deficiency in recovering sequences from random ordering, and suggesting a heavy reliance on shared local orderings between the input and output sequences.Notably, the performance of NALM-pos is not as good as that of NALM.This illustrates that reordering models clearly expect the aforementioned shared local orderings.When the said ordering information is removed, positional information, considered by NALM-pos and all adapted models, would only confuse learning and hamper performance.

On different degrees of permutation
The previous experiment assumes that input is shuffled at the token level.To evaluate on ordering tasks that better reflect real reordering situations in NAT alignment, we further conduct testing on r04, r06 and d05 datasets which permute the dataset at n-gram level.Table 3 shows that NALM-pos's performance leads all other adapted reordering mechanisms by at least 11 BLEU.According to the experimental results, the more permuted the data, the poorer the performance of all the models, except NALM, which is invariant to input permutation.Interestingly, this simple design already outperforms all other adapted mechanisms for all datasets by 2-10 BLEU, showing great versatility in all settings of word permutation.After augmenting it with positional information, NALM-pos advances performance further by 7-15 BLEU.We also report METEOR score in this experiment, and it more or less reflects the same trend as in BLEU.
We note that adapted reordering mechanism of the sota NAT model does not perform well when it stands alone, suggesting the need of further investigation.As for Zhou et al., 2020, upon closer inspection, we find that their reordering mechanism simply learns to copy.This also explains its poor performance in the first experiment, as it simply copied the random input permutation which score terribly against the ground truth sequence.

Conclusion
In this paper, we review reordering mechanisms of NAT models that directly model alignment using various settings of word permutation.We propose a non-autoregressive language model which outperforms in low-beam and competes with in higher-beam setting sota autoregressive models.Our extended model further achieves significant improvement over all adapted NAT reordering mechanisms in datasets of varying difficulty that reasonably resemble the reordering task encountered in NAT alignment.Performance of existing reordering mechanisms in NAT models vary according to our experiment results, implying that more effort would be required in this area.

Limitations
We acknowledge that our experiment is a simplification to the real reordering problem in NAT alignment.Results can only partially reflect the capability of concerned models in optimal conditions.A better experiment would be to include also subword vocabularies in bilingual settings, with supervised reordering data.We leave this to our future work.
Figure1: Architectures of the reordering components adopted in the existing works and our proposed models.For example, ReorderNAT-r denotes the reordering mechanism in ReorderNAT.

Table 2 :
Hasler et al., 2017 word-ordering task on the ptb2016 dataset.Other than AttM, whose performance is reported fromTao et al., 2021, all previous works (indicated by *) are reported fromHasler et al., 2017.Autoregressive (AR) models are listed on the left while non-autoregressive (NAR) models are listed on the right.

Table 3 :
BLEU and METEOR scores for the word reordering task on the r04, r06, and d05 datasets.Sample results are also provided in Tables4 and 5in appendix A. Note that Distortion-r is removed from the table as it learns only to copy from the input permutation.