Viterbi Decoding of Directed Acyclic Transformer for Non-Autoregressive Machine Translation

Non-autoregressive models achieve significant decoding speedup in neural machine translation but lack the ability to capture sequential dependency. Directed Acyclic Transformer (DA-Transformer) was recently proposed to model sequential dependency with a directed acyclic graph. Consequently, it has to apply a sequential decision process at inference time, which harms the global translation accuracy. In this paper, we present a Viterbi decoding framework for DA-Transformer, which guarantees to find the joint optimal solution for the translation and decoding path under any length constraint. Experimental results demonstrate that our approach consistently improves the performance of DA-Transformer while maintaining a similar decoding speedup.


Introduction
Non-autoregressive translation (Gu et al., 2018) models achieve a significant decoding speedup but suffer from performance degradation, which is mainly attributed to the multi-modality problem.Multi-modality refers to the scenario where the same source sentence may have multiple translations with a strong cross-correlation between target words.However, non-autoregressive models generally hold the conditional independence assumption on target words, which prevents them from capturing the multimodal target distribution.
Recently, Directed Acyclic Transformer (Huang et al., 2022) was proposed to model sequential dependency with a directed acyclic graph consisting of different decoding paths that enable the model to capture multiple translation modalities.Although it has been proven effective, it cannot directly find the most probable translation with the argmax operation.Therefore, DA-Transformer has to apply a se-In this paper, we propose a Viterbi decoding (Viterbi, 1967) framework for DA-Transformer to improve the decoding accuracy.Using the Markov property of decoding path, we can apply Viterbi decoding to find the most probable path, conditioned on which we can generate the translation with argmax decoding.Then, we further improve this decoding algorithm to perform a simultaneous search for decoding paths and translations, which guarantees to find the joint optimal solution under any length constraint.After Viterbi decoding, we obtain a set of translations with different lengths and rerank them to obtain the final translation.We apply a length penalty term in the reranking process, which prevents the generation of empty translation (Stahlberg and Byrne, 2019) and enables us to control the translation length flexibly.
Experimental results on several machine translation benchmark tasks (WMT14 En↔De, WMT17 Zh↔En) show that our approach consistently improves the performance of DA-Transformer while maintaining a similar decoding speedup.
2 Preliminaries: DA-Transformer 2.1 Model Architecture DA-Transformer is formed by a Transformer encoder and a directed acyclic decoder.The encoder and layers of the decoder are the same as vanilla Transformer (Vaswani et al., 2017).On top of the decoder, the hidden states are organized as a directed acyclic graph, whose edges represent transition probabilities between hidden states.

Given a source sentence
where λ is a hyperparameter.The translation probability from X to where P θ (y i |a i , X) represents the translation probability of word y i on the position a i of decoder.

Training and Inference
The training objective of DA-Transformer is to maximize the log-likelihood log P θ (Y |X), which requires marginalizing all paths A.
Using the Markov property of translation path, DA-Transformer employs dynamic programming to calculate the translation probability.Besides, it applies glancing training (Qian et al., 2021) with a hyper-parameter τ to promote the learning.
During inference, the objective is to find the most probable translation argmax Y P θ (Y |X).However, there is no known tractable decoding algorithm for this problem.Huang et al. (2022) proposed three approximate decoding strategies to find high-probability translations.The intuitive strategy is greedy decoding, which sequentially takes the most probable transition as the decoding path and generates a translation according to the conditional probabilities.Lookahead decoding improves greedy decoding by taking the most probable combination of transition and prediction as follows: Beam search decoding is a more accurate method that merges the paths of the same prefix, which approximates the real translation probability and better represents the model's preference.Beam search can be optionally combined with an n-gram language model to improve the performance further.However, the speed of beam search is much lower than greedy and lookahead decoding.

Methodology
This section presents a Viterbi decoding framework for DA-Transformer to improve decoding accuracy.
We first develop a basic algorithm to find the optimal decoding path and then improve it to find the joint optimal solution of the translations and decoding paths.Finally, we introduce the technique to rerank the Viterbi decoding outputs.

Optimal Decoding Path
Recall that the greedy decoding strategy sequentially takes the most probable transition as the decoding path, which may not be optimal since the greedy strategy does not consider long-term profits.In response to this problem, we propose a Viterbi decoding framework for DA-Transformer that guarantees to find the optimal decoding path argmax A P θ (A|X) under any length constraint.Specifically, we consider decoding paths of length i that end in position a i = t, and use α(i, t) to represent the maximum probability of these paths.By definition, we set the initial state α(1, 1) = 1 and α(1, t > 1) = 0.The Markov property of decoding paths enables us to sequentially calculate α(i, •) from its previous step α(i − 1, •): where E is the transition matrix defined in Equation 2and ψ(i, t) is the backtracking index pointing to the previous position.After L iterations, we obtain the score for every possible length, and then we can find the optimal length with the argmax function: After determining the length M , we can trace the best decoding path along the backtracking index starting from a M = L: Finally, conditioning on the optimal path A, we can generate the translation with argmax decoding:

Joint Optimal Solution
The decoding algorithm described above can be summarized as the following process: Even though the algorithm now finds the optimal decoding path, the translation on this path may have low confidence, resulting in a low joint probability P θ (A, Y |X).We further improve the decoding algorithm to search for both decoding paths and translations, which guarantees to find the joint optimal solution: Notice that when the path A is given, we can easily find the most probable translation Y with argmax decoding.Let Y A denotes the argmax decoding result under path A, where y a i i = argmax y i P θ (y i |a i , X) is the i-th word of Y A .Then we can simplify our objective with Y A : where we introduce a new transition matrix E with E a i ,a i+1 = E a i ,a i+1 P θ (y a i+1 i+1 |a i+1 , X).Compared to max A P θ (A|X), the major difference is the transition matrix E , which considers both the transition probability and the prediction probability.Therefore, we can still apply the Viterbi decoding framework to find the optimal joint solution.
We use 'Viterbi' to represent the Viterbi decoding algorithm proposed in section 3.1, and use 'Joint-Viterbi' to represent the improved algorithm in this section that finds the joint optimal solution.It is worth noting that Viterbi and Joint-Viterbi can be regarded as improvements to greedy decoding and lookahead decoding, respectively.Both greedy decoding and lookahead decoding consider the one-step probability and find the next token with argmax a i P θ (a i |X, a i−1 ) and argmax y i ,a i P θ (y i |a i , X)P θ (a i |a i−1 , X), respectively.In comparison, Viterbi and Joint-Viterbi consider the whole decoding path and guarantee to find the global optimal solution argmax A P θ (A|X) and argmax A,Y P θ (A, Y |X), respectively.

Reranking with Length Penalty
After Viterbi decoding, we have a set of translations of different lengths that can be ranked to obtain the most probable one.However, argmax decoding is biased toward short translations and may even degenerate to an empty translation, as also observed in Stahlberg and Byrne (2019).
To solve this problem, we introduce the hyperparameter β for length normalization in Wu et al. (2016) and modify Equation 6to divide by the length penalty term: By changing the length penalty β to different values, we now have the flexibility to control the translation length with little additional overhead, which is another appealing feature of our approach.

Settings
We conduct experiments on WMT14 English↔German (En↔De, 4.5M pairs) and WMT17 Chinese↔English (Zh↔En, 20M pairs).These datasets are all encoded into subword units (Sennrich et al., 2016).We use the same preprocessed data and train/dev/test splits as Kasai et al. (2020).The translation quality is evaluated with sacreBLEU (Post, 2018) for WMT17 En-Zh and tokenized BLEU (Papineni et al., 2002) for other benchmarks.We use GeForce RTX 3090 to train models and measure translation latency.Our models are implemented based on the open-source toolkit of fairseq (Ott et al., 2019).
We strictly follow the hyper-parameter settings of Huang et al. (2022) to reimplement DA-Transformer.We adopt Transformer-base (Vaswani et al., 2017) as the model architecture.We set dropout to 0.1, weight decay to 0.01, and label smoothing to 0.1 for regularization.We use λ = 8  (Koehn, 2004).
for the graph size and linearly anneal τ from 0.5 to 0.1 for the glancing training.For fair comparisons, we tune the length penalty in [0.95, 1.05] to obtain a similar translation length as lookahead.train all models for 300K steps, where each batch contains approximately 64K source tokens.All models are optimized by Adam (Kingma and Ba, 2014) with β = (0.9, 0.999) and = 10 −8 .The learning rate warms up to 5 • 10 −4 and then begins to anneal it after 10K steps with the inverse squareroot schedule.We calculate the validation BLEU scores every epoch and obtain the final model by taking an average of the best five checkpoints.

Main Results
As shown in Table 1, both Viterbi and Joint-Viterbi improve over their corresponding baseline.Joint-Viterbi achieves the best performance, which outperforms the previous lookahead strategy by 0.33 BLEU.Besides, it is worth noting that the Viterbi decoding process is highly parallelizable, which does not bring much overhead in the decoding and only reduces the speedup by less than 1×.

Results with Knowledge Distillation
In this section, we evaluate the performance of our method with sequence-level knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016), where the target side of the training set is replaced by the output of an autoregressive teacher model.Experimental results in Table 2 show that the differences between decoding strategies are relatively small.Intuitively, we attribute this phenomenon to the improvement of model confidence.As knowledge distillation reduces the multi-modality of the dataset (Zhou et al., 2020;Sun and Yang, 2020), the model may become more confident in predicting target sentences, which makes the greedy strategy more likely to reach the optima.To verify this,  we measure the average entropy of transition and prediction probabilities and evaluate the percentage of lookahead outputs that match the optima argmax A,Y P (A, Y |X) under their length.As Table 3 shows, DA-Transformer with distillation has smaller entropies and a larger percentage of optimal translations, which confirms our intuition.

Probability Analysis
Recall that the decoding objective is to find the most probable translation argmax Y P (Y |X), while our approach finds the joint solution argmax A,Y P (A, Y |X).Although there is a gap between them, we argue that optimizing the joint probability helps us achieve higher translation probability.To prove it, we collect the outputs of lookahead decoding and Joint-Viterbi on WMT14 En-De test set and compute their probabilities P (Y |X) by dynamic programming.We then calculate the average log probability of each decoding strategy, and also evaluate the percentage of translations that one strategy obtains a larger probability than another.
As Table 4 shows, Joint-Viterbi outperforms lookahead decoding by a large margin, indicating that we can obtain a higher average translation probability  by optimizing the joint probability.

Effect of Length Penalty
Viterbi decoding is capable of flexibly controlling the output length with the length penalty β.To show the effect of the length penalty, we change the value of β in Joint-Viterbi to decode the WMT17 Zh-En test set and report the corresponding BLEU scores and average output lengths in Figure 1.It shows that the length penalty can almost linearly control the output length, which can help us obtain satisfactory translations.Generally, Viterbi decoding can obtain better performance when the output length is closer to the reference length.If there is no length penalty, only finding outputs with the maximum joint probability will break the translation quality with extremely small output lengths.
Figure 1: The effect of length penalty β measured on WMT17 Zh-En test set.

Related Works
Most non-autoregressive models can directly find the most probable output with argmax decoding, which is the fastest decoding algorithm.However, models of this type usually suffer from the multimodality problem (Gu et al., 2018), leading to severe performance degradation.A relatively more accurate method is noisy parallel decoding, which requires generating multiple translation candidates and greatly increases the amount of computation.Many efforts have been made to address the multi-modality problem, including latent models (Kaiser et al., 2018;Ma et al., 2019;Shu et al., 2020;Bao et al., 2021Bao et al., , 2022)), alignment-based models (Gu et al., 2018;Ran et al., 2021;Song et al., 2021), and better training objectives (Shao et al., 2019(Shao et al., , 2020;;Shan et al., 2021;Ghazvininejad et al., 2020;Du et al., 2021;Shao et al., 2021).However, these techniques are still not powerful enough, which heavily rely on knowledge distillation (Kim and Rush, 2016).
Viterbi decoding has also been used in nonautoregressive models.In CRF-based NAT models, Viterbi decoding is applied to find the most probable output (Sun et al., 2019;Sun and Yang, 2020).

Conclusion
The current decoding strategies of DA-Transformer need to apply a sequential decision process, which harms the global translation accuracy.In this paper, we propose a Viterbi decoding framework for DA-Transformer to find the joint optimal solution of the translation and decoding path and further demonstrate its effectiveness on multiple benchmarks.

Acknowledgement
argmax Y P (Y |X) but alternatively finds the joint optimal solution argmax A,Y P (A, Y |X).However, as we show in section 4.4, outputs with higher joint probability usually also have higher translation probability, suggesting that optimizing the joint probability is helpful.
Another limitation is that the improvements of our method are smaller in the knowledge distillation setting.However, the main advantage of DA-Transformer is that it does not heavily rely on knowledge distillation and achieves superior performance on raw data, which makes the impact of this limitation small.

Table 1 :
Results on WMT14 En↔De and WMT17 Zh↔En.M is the length of the target sentence.'Iter' means the number of decoding iterations.The speedup is evaluated on WMT14 En-De test set with a batch size of 1. † means significantly better than the baseline model (p < 0.05).We use the statistical significance test with paired bootstrap resampling

Table 2 :
Results with knowledge distillation on WMT14 En-De test set.

Table 3 :
Statistics of DA-Transformer on WMT14 En-De test set.'kd' means knowledge distillation.'T-' means transition and 'P-' means prediction.

Table 4 :
Probability analysis of on Lookahead and Joint-Viterbi decoding on WMT14 En-De test set.