Hybrid-Regressive Paradigm for Accurate and Speed-Robust Neural Machine Translation

,


Introduction
Autoregressive translation (AT) such as Transformer has been the de facto standard for Neural Machine Translation (NMT) (Vaswani et al., 2017).However, AT predicts only one target word at a time, resulting in slow inference speed.To overcome this limitation, non-autoregressive translation (NAT) attempts to generate the entire target sequence in one step by assuming conditional independence among target tokens (Gu et al., 2018).While NAT offers efficiency, it often suffers from significant degradation in translation quality.Achieving a better trade-off between inference speed and translation quality remains an active area of research for NAT (Wang et al., 2018a;Ran et al., 2020;Qian et al., 2021;Huang et al., 2022b,a).
One of the most successful approaches to this issue is the iterative refinement mechanism (IR-NAT) proposed by Lee et al. (2018), which has been widely adopted by several leading systems (Ghazvininejad et al., 2019;Kasai et al., 2020a;Guo et al., 2020;Saharia et al., 2020;Geng et al., 2021;Huang et al., 2022b).Specifically, IR-NAT, also known as multi-shot NAT, takes the translation hypothesis from the previous iteration as a reference to refine the new translation until it reaches the predefined iteration count I or no translation changes.Although a larger I can improve translation accuracy, it may also lead to a speedup degradation (Kasai et al., 2020b).
In this work, we build upon the findings of Kasai et al. (2020b) and examine the robustness of IR-NAT compared to AT.Our comprehensive examinations confirm that the inference speed of IR-NAT is consistently less robust than that of AT when involving various decoding batch sizes and computing hardware.For example, when using a GPU, the ten-iteration non-autoregressive model has 1.7/1.2/0.7/0.4 times the inference speed of the AT model for decoding batch sizes of 1/8/16/32, respectively.However, when switching to CPU, the relative speed ratio drops to 0.8/0.4/0.3/0.3 times.Previous studies have highlighted the complementary nature of AT and NAT in terms of both translation quality (AT being superior) and inference speed (NAT being superior) (Wang et al., 2018a;Ran et al., 2020).Our findings, however, suggest that there is also complementary robustness in inference speed (AT being superior).
Taking a further step, we investigate how much target context (i.e., the number of target tokens) is sufficient for one-shot NAT to rival multi-shot NAT through synthetic experiments.Our findings suggest that given a well-trained CMLM model, even if 70% of AT translations are masked, the Figure 1: Relative speedup ratio (α) of CMLM compared to AT on GPU (solid) and CPU (dashed).α < 1 indicates that CMLM is slower than AT.CMLM 1 can be regarded as the representative of one-shot NAT in inference speed.
remaining target context can help the CMLM 1 with greedy search compete with the standard CMLM 10 with beam search (see Figure 2).This could enable us to build the desired target context more cheaply, replacing expensive multiple iterations.To our best knowledge, this is the first study of the masking rate issue in the inference phase of NAT.
Based on the observations from these experiments, we have proposed a novel two-stage translation prototype called hybrid-regressive translation (HRT).This method combines the advantages of autoregressive translation (AT) and nonautoregressive translation (NAT) by first using an autoregressive decoder to generate a discontinuous target sequence with an interval of k (k > 1), and then filling the remaining slots in a lightweight non-autoregressive manner.We have also created a multi-task learning framework, enhanced by curriculum learning, for effective and efficient training without adding any model parameters.Results on WMT En↔Ro, En↔De, and NIST Zh→En show that HRT outperforms prior work combining AT and NAT, and is competitive with state-of-theart IR-NAT models.Specifically, HRT achieved a BLEU score of 28.27 on the WMT En→De task, and is 1.5x faster than AT regardless of batch size and device.Additionally, HRT equipped with a deep-encoder-shallow-decoder architecture achieved up to 4x/3x acceleration on GPU/CPU, respectively, without sacrificing BLEU.

Background
Given a source sentence x = {x 1 , x 2 , . . ., x M } and a target sentence y = {y 1 , y 2 , . . ., y N }, there are several ways to model P (y|x): Autoregressive Translation (AT) AT is the predominant technique in NMT, decomposing P (y|x) using the chain rule: P (y|x) = N t=1 P (y t |x, y <t ), where y <t denotes the prefix translation generated before time step t.Nevertheless, autoregressive models must wait for y t−1 to be generated before predicting y t , thus hindering parallelism over the target sequence.
Non-Autoregressive Translation (NAT) NAT has been proposed to generate target tokens simultaneously (Gu et al., 2018).This approach replaces the traditional autoregressive formulation of y <t with a target-independent input z, resulting in the following formulation: P (y|x) = P (N |x) × N t=1 P (y t |x, z).Various approaches have been proposed for modeling z, such as using source embedding (Gu et al., 2018;Guo et al., 2019), reordering the source sentence (Ran et al., 2019), or using a latent variable (Ma et al., 2019;Shu et al., 2019).
Iterative Refinement based Non-Autoregressive Translation (IR-NAT) IR-NAT extends the traditional one-shot NAT by introducing an iterative refinement mechanism (Lee et al., 2018).We choose CMLM (Ghazvininejad et al., 2019) as the representative of IR-NAT due to its excellent performance and simplification.During training, CMLM randomly masks a fraction of tokens on y as the alternative to z, and is trained as a conditional masked language model (Devlin et al., 2019).Denoting y m /y r as the masked/residual tokens of y, we have: P (y|x) = |y m | t=1 P (y m t |x, y r ).At inference, CMLM deterministically masks tokens from the hypothesis in the previous iteration ŷ(i−1) according to prediction confidences.This process is iterated until ŷ(i−1) = ŷ(i) or i reaches the maximum iteration count.

Acceleration Robustness Problem
In this section, we comprehensively analyze the inference acceleration robustness problem in IR-NAT.Without loss of generality, we take CMLM as the agency of IR-NAT.2 Problem Description The inference overhead of the autoregressive translation model mainly con-centrates on the decoder side (Hu et al., 2020).Suppose that the decoder's computational cost is proportional to the size of its input tensor (B, N, H), where B is the batch size, N is the target sequence length, and H is the network dimension.We omit H for convenience due to its invariance in NAT and AT.Thus, the total cost of AT model is about where ≤ 1, denotes the parallel computation efficiency over sequence under batch size B and device D. When fixing N and I, α is completely determined by E(B, D).We note that most previous NAT studies only report the inference speed with D=GPU and B=1, without considering cases where B or D change.
Setup We systematically investigate the inference speed of CMLM4 and AT under varying environments, including batch size B ∈ {1, 8, 16, 32}, device D ∈ {GPU, CPU}5 , and the number of iterations I ∈ {1, 4, 10}, using a beam size of 5. We test inference speed on the widely used WMT En→De newstest2014 test set and report the average results over five runs (see Appendix A for details).

Results
We plot the curve of relative speedup ratio (α) in Figure 1 and observe that: i. α decreases as decoding batch size increases regardless of the number of iterations, as noted by Kasai et al. (2020b).
ii. α on CPU generally performs worse than GPU, except when using one iteration.
iii.The benefit of non-autoregressive decoding is more prone to disappear for larger numbers of iterations (I).
For instance, when decoding a single sentence on the GPU, the inference speed of the ten-iteration non-autoregressive model is 170% that of the autoregressive model.However, when switching to batches of 32 on CPU, the IR-NAT model only reaches 30% of the AT model's inference speed.These results demonstrate that AT and NAT possess different strengths, and combining the advantages of both models could be an effective way to achieve robust acceleration.

Synthetic Experiments
According to Equation 1, reducing the iteration count I helps to increase α.Recalling the refinement process of IR-NAT, we hypothesize that the essence of multiple iterations is to provide the decoder with a good enough target context (deterministic target tokens).This raises the question of how many target tokens need to be provided to make one-shot NAT competitive with IR-NAT?To answer it, we conduct synthetic experiments on WMT En→Ro and En→De to control the size of the target context by masking the partial translations generated by a pre-trained AT model.We then use a pre-trained CMLM model to predict these masks and observe the BLEU score curves under different masking rates.

Models
We use the official CMLM models.Since the authors did not release the AT baselines, we used the same data to retrain AT models with the standard Transformer-Base configuration (Vaswani et al., 2017) and obtain comparable performance with the official ones (see Appendix B for details).
Decoding AT models decode with beam sizes of 5 on both tasks.Then we replace a certain percentage of AT tokens with [MASK] and feed them to CMLM.The used CMLM model only iterates once with beam size 1.We substitute all [MASK]s with CMLM's predictions to obtain the final translation.We report case-sensitive tokenized BLEU scores by multi-bleu.perl.

Mask Strategies
We tested four strategies to mask AT results: HEAD, TAIL, RANDOM, and CHUNK.
Given the masking rate p mask and the translation length N , the number of masked tokens is N mask =max(1, ⌊N ×p mask ⌋).Then HEAD/TAIL always masks the first/last N mask tokens, while RANDOM masks the translation randomly.CHUNK is slightly different from the above strategies.It first divides the target sentence into C chunks, where C = Ceil(N/k) and k is the chunk size.Then in each chunk, we retain the first token but mask other k − 1 tokens.Thus, the actual masking rate in CHUNK is 1 − 1/k instead of p mask .We ran RANDOM three times with different seeds to exclude randomness and report the average results.

Results
The experimental results in Figure 2 demonstrate that CHUNK is moderately and consistently superior to RANDOM, and both strategies significantly outperform HEAD and TAIL.We attribute this success to the use of (1) bidirectional context (Devlin et al., 2019) (vs.HEAD and TAIL), and (2) the uniform distribution of deterministic tokens (vs.RANDOM)6 .Furthermore, when using the CHUNK strategy, we find that exposing 30% AT tokens as the input of the decoder is enough to make our CMLM 1 (beam=1) competitive with the official CMLM 10 (beam=5), which emphasizes the importance of a good partial target context.

Hybrid-Regressive Translation
We propose a novel two-stage translation paradigm, Hybrid-Regressive Translation (HRT), which imitates the CHUNK process.In HRT, a discontinuous sequence with a chunk size of k is autoregressively generated in stage I, followed by nonautoregressive filling of the skipped tokens in stage II.

Architecture
Overview HRT consists of three components: encoder, Skip-AT decoder (for stage I), and Skip-CMLM decoder (for stage II).All components adopt the Transformer architecture (Vaswani et al., 2017).The two decoders have the same network structure, and we share them to make the parameter size of HRT the same as the vanilla Transformer.The only difference between the two decoders lies in the masking pattern in self-attention: The Skip-AT decoder masks future tokens to guarantee strict left-to-right generation like the autoregressive Transformer.In contrast, the Skip-CMLM decoder eliminates it to leverage the bi-directional context like CMLM (Ghazvininejad et al., 2019).
No Target Length Predictor Thanks to Skip-AT, we can obtain the translation length as a by-product: , where N at is the sequence length produced by Skip-AT.Our approach has two major advantages over most NAT models, which jointly train both the translation length predictor and the translation model.Firstly, there is no need to carefully adjust the weighting coefficient between the sentence-level length prediction loss and the wordlevel target token prediction loss.Secondly, the length predicted by Skip-AT is more accurate due to its access to the already-generated sequence information.

Training
Next, we elaborate on how to train the HRT model efficiently and effectively.Please refer to Appendix C for the entire training algorithm.
Multi-Task Framework We learn HRT by jointly training four tasks: two primary tasks (TASK-SKIP-AT, TASK-SKIP-CMLM) and two auxiliary tasks (TASK-AT, TASK-CMLM).All tasks use cross-entropy as the training objective.Figure 3 illustrates the differences in training samples among these tasks.Notably, TASK-SKIP-AT shrinks the sequence length from N to N/k, while preserving the token positions from the original sequence.For example, in Figure 3 (c), the position of TASK-SKIP-AT input ([B 2 ], y 2 , y 4 ) is (0, 2, 4).Auxiliary tasks are necessary to leverage all tokens in the sequence, as the two primary tasks are limited by the fixed k.For example, in Figure 3 (c) and (d), y 1 and y 3 cannot be learned as the decoder input of either TASK-SKIP-AT or TASK-SKIP-CMLM.
Curriculum Learning To ensure the model is not overly biased towards auxiliary tasks, we propose gradually transferring the training tasks from auxiliary tasks to primary tasks through curriculum learning (Bengio et al., 2009).We start with a batch of original sentence pairs B, and let the proportion of primary tasks in B be p k =0.We construct the training samples of TASK-AT and TASK-CMLM for all pairs, then gradually increase p k to introduce more learning signals for TASK-SKIP-AT and TASK-SKIP-CMLM until p k =1.We schedule p k by p k = (t/T ) λ , where t and T are the current and total training steps, and λ is a hyperparameter set to 1 for linear increase.
Skip-CMLM Decoder cat The sat on the mat Append masks

Skip-AT Decoder
Figure 4: The decoding process of HRT with k=2.For the sake of clarity, we omit the source sentence.

Inference
As illustrated in Figure 4 where z i =ŷ i×k .

Discussion
The basic idea of HRT is to apply Autoregressive Transformation (AT) and Non-Autoregressive Transformation (NAT) in sequence.This concept has been investigated before (Kaiser et al., 2018), Ran et al. ( 2019), and Akoury et al. (2019).The main differences between these methods lie in the content of the AT output, such as latent variable (Kaiser et al., 2018), reordered source tokens (Ran et al., 2019), and syntactic labels (Akoury et al., 2019).In contrast, our approach uses the deterministic target token following Ghazvininejad et al. (2019).
HRT is related to chunk-wise decoding, another line incorporating AT and NAT.Table 1 shows the differences between HRT and prior studies, including SAT (Wang et al., 2018a), RecoverSAT (Ran et al., 2020), and LAT (Kong et al., 2020).SAT and LAT follow the generation order of from-leftto-right, behaving similarly to HEAD as described in Section 4. In contrast, RecoverSAT and HRT generate discontinuous target contexts, which have been shown to perform better than HEAD according to our synthetic experiments.However, Recover-SAT cannot accurately generate the discontinuous context "a,c,e" via non-autoregression, resulting in error propagation for generating "b,d,f".However, HRT produces "a,c,e" through accurate autoregression.Additionally, although HRT requires more decoding steps, its non-autoregressive process is inexpensive due to greedy searches.In contrast, other methods require larger beams to explore translations of different lengths.

Experimental Results
Setup We conduct experiments on five tasks, including WMT'16 English↔Romanian (En↔Ro, 610k), WMT'14 English↔German (En↔De, 4.5M) and long-distance language pair NIST Chinese-English (Zh→En, 1.8M).For fair comparisons, we replicate the same data processing as Ghazvininejad et al. (2019)   HRT through sequence-level knowledge distillation (Kim and Rush, 2016).Specifically, we use the standard Transformer-Base as teacher models for En↔Ro and Zh→En, while we use the deep PreNorm Transformer-Base with a 20-layer encoder for harder En↔De.We run all experiments on four 2080Ti GPUs.Unless noted otherwise, we use the chunk size k=2.We fine-tune HRT models on pre-trained AT models and take 100k/300k/100k training steps for En↔Ro/En↔De/Zh→En, respectively.Other training hyperparameters are the same as Vaswani et al. (2017) or Wang et al. (2019) (deep-encoder).We report both case-sensitive tokenized BLEU scores and SacreBLEU7 .We also report COMET as suggested by Helcl et al. (2022).Table 2: Results of BLEU, SacreBLEU (denoted by †), and COMET on four WMT tasks.By default, these models have a 6-layer encoder and a 6-layer decoder.The subscript X − Y is used to denote the X-layer encoder and Y -layer decoder.Boldface results are significantly better (p<0.01)than those of the autoregressive counterparts in the same network capacity trained by raw data, as indicated by paired bootstrap resampling (Koehn, 2004).otherwise stated.

Main Results
We compare the performance of HRT with existing systems in different translation paradigms on four WMT tasks, as shown in Table 2. HRT with distillation data consistently outperforms that of raw data and most existing NAT, IR-NAT, and Semi-NAT models, obtaining a BLEU score of 28.27 on the widely used En→De task.Compared to the re-implemented typical semi-autoregressive model (SAT) and one-shot non-autoregressive model (GLAT-CTC), HRT obtains an improvement of approximately 1.7 BLEU points, with a more significant margin in COMET score.Moreover, HRT 20−6 can improve by 0.7 BLEU and 0.02 COMET when using a deeper encoder.Interestingly, the evaluation results of BLEU and COMET are inconsistent, as observed by Helcl et al. (2022).For instance, HRT 20−6 has higher BLEU score than AT 20−6 on En→De, but its COMET score is still lower.Furthermore, the experimental results on the Zh→En task, as reported in Table 4, demonstrate that the effectiveness of HRT is agnostic to language pairs, as it is close or superior to the original AT and CMLM model.We attribute this to two reasons: (1) HRT is fine-tuned on a well-trained AT model; (2) Multi-task learning on autoregressive and non-autoregressive tasks has better regularization than training alone.

Analysis
Impact of Chunk Size We tested chunk size k on the En→De task, as shown in Table 5.We observed that larger values of k had a more significant speedup on the GPU, as fewer autoregressive steps were required.However, as k increased, the performance of HRT dropped sharply; for example, k=4 was about 1.24 BLEU points lower than k=2 on the test set.This suggests that the training difficulty of Skip-AT increases as k becomes larger.Further investigation into more sophisticated training algorithms to address this is left for our future work.Which decoding stage is more important?To understand the importance of the two decoding stages of HRT, we exchange the intermediate results of two HRT models (A and B).Specifically, we use the Skip-AT decoder of A to generate its discontinuous target sequence, which is then forced decoded by B's Skip-AT decoder to obtain corresponding encoding representations and autoregressive model scores.Finally, B's Skip-CMLM decoder generates the complete translation result based on these.We can reverse the order of A and B as well.We use two models (HRT and HRT 20−6 ) with a large performance gap as A and B, respectively.As shown in Table 6, we find that using the result of stage I of the strong model brings a greater improvement (+0.61 BLEU) than that of stage II (+0.15 BLEU).This result supports our hypothesis that a good partial target context is essential.
Deep-encoder-shallow-decoder Architecture Kasai et al. (2020b) showed AT with deepencoder-shallow-decoder architecture can speed up translation without sacrificing accuracy, while CMLM fails.To validate whether HRT can inherit this, we compared HRT and AT with a 12-layer encoder and 1-layer decoder (HRT 12−1 and AT 12−1 ), using the same distillation data.As Table 7 shows, both AT 12−1 and HRT 12−1 benefit from the layer allocation, achieving comparable BLEU scores and double the decoding speed of the vanilla models.Specifically, HRT 12−1 achieved an average acceleration of 4.2x/3.1xover the AT baselines.This suggests HRT 12−1 's success was due to Skip-AT rather than Skip-CMLM.However, its COMET scores are lower than 6-6 architecture.Further research into decoder depth and COMET correlation will be conducted.

Ablation Study
In Table 8, we conduct an ablation study on the En→De task to investigate the contribution of fine-tuning from pre-trained AT (FT), training steps (TS), and curriculum learning (CL).We test two settings about CL: Fixing p k =1 is equivalent to removing auxiliary tasks; Fixing p k =0.5 assigns the same probability to the primary and auxiliary tasks.The results show that all components contribute to the performance, but CL and TS are the most critical, with a reduction of 0.74 and 0.45 BLEU points, respectively.Excluding all components from the vanilla HRT (-ALL) leads to a total reduction of 1.68 BLEU points.
Case study Table 9 presents a translation example from En→De validation set.Comparing CMLM 5 and HRT, both having the same masking rate (50%), two main distinctions can be observed: (1) The distribution of masked tokens in CMLM is more discontinuous than in HRT (as indicated by the blue marks); (2) The decoder input of HRT contains more accurate target tokens than CMLM, due to the Skip-AT decoder (as indicated by the wavy marks).These differences make our model more effective in producing high-quality translations than CMLM, and suggest that our model can generate appropriate discontinuous sequences.
Source Also problematic : civil military jurisdiction will continue to be uph@@ eld .

Conclusion
We noted that IR-NAT has robustness issues with inference acceleration.Inspired by our findings in synthetic experiments, we proposed HRT to take advantage of the strengths of both AT and NAT.
Our experiments demonstrated that our approach surpasses existing semi-autoregressive and IR-NAT methods, providing competitive performance and consistent speedup, making it a viable alternative to autoregressive translation.

Limitations
The main limitation of HRT is that its upper bound on the inference speedup is lower than that of single-iteration NAT under the same network architecture.As demonstrated in Appendix A, the average speedup of single-iteration NAT (i.g., CMLM 1 ) is 4.7x/3.2xon GPU/CPU, respectively, while that of HRT is 1.7x/1.6x.To achieve higher acceleration, HRT needs to employ the deep-encodershallow-decoder architecture.Increasing the chunk size is a simple way to reduce the autoregressive cost, yet it results in severe BLEU degradation (see Table 5).Further research should be conducted to maintain high translation performance with fewer autoregressive prompts.
Given a fixed test set, We can use T D (•) to represent the translation time on computing device D. This allows us to calculate the relative speedup ratio α between I-iteration NAT and AT as:

Figure 2 :
Figure 2: Comparison of four masking strategies {HEAD, TAIL, RANDOM, CHUNK} in synthetic experiments on WMT En→Ro (Left) and En→De (Right) test sets.For CHUNK, we test the chunk size from {2, 3, 4}.Dashed lines are the official CMLM scores.b stands for "beam size," and I stands for "the number of iterations".

Figure 3 :
Figure 3: Examples of training samples for four tasks, where (a) and (b) are auxiliary tasks and (c) and (d) are primary tasks.To make the explanation clearer, the source sequence has been omitted.The special tokens [BOS], [EOS], [PAD], and [MASK] are represented by [B], [E], [P], and [M], respectively.Additionally, [B 2 ] is the [BOS] for k=2 and the loss at [P] is not taken into account.
, HRT adopts a two-stage generation strategy.In the first stage, the Skip-AT decoder autoregressively generates a discontinuous target sequence ŷat = (z 1 , z 2 , . . ., z m ) with chunk size k, starting from [BOS k ] and ending with [EOS].Then, the input of Skip-CMLM decoder y nat is constructed by appending k − 1 [MASK]s before every z i .The final translation is generated by replacing all [MASK]s with the predicted tokens after one iteration of the Skip-CMLM decoder.If multiple [EOS]s exist, we truncate to the first [EOS].The beam sizes b at and b nat can be different from each other, as long as b at ≥ b nat .In our implementation, we use standard beam search in Skip-AT (b at >1) and greedy search in Skip-CMLM (b nat =1).Table 3 provides more details on the beam size setting in HRT.The translation hypothesis with the highest score S(ŷ) is chosen by summing the Skip-AT score and the Skip-CMLM score: m i=1 log P (z i |x, z <i ) in four WMT tasks and follow the setup of Wang et al. (2018b) for Zh→En.Like previous work, we train Method Generation SAT a, b → c, d → e, f RecoverSAT a, c, e → b, d, f LAT {a → b → c, d → e → f } HRT (Our) a → c → e b, d, f We first verify the influence of two beam sizes of HRT (b at and b nat ) on the BLEU score and relative speedup ratio by testing three different setups.The results are listed in Table 3.Consistent with our observations in synthetic experiments, using b nat =1 only slightly reduces BLEU but significantly improves decoding efficiency.Considering the trade-off between translation quality and speed, and for a fair comparison with other baselines (most prior related work uses beam size of 5), we use b at =5 and b nat =1 unless

Table 4 :
BLEU scores on NIST Zh→En task.

Table 5 :
Effects of chunk size (k) on BLEU and α.

Table 8 :
Ablation study on En→De task.

Table 9 :
A case study in En→De validation set.Blue denotes the original input is[MASK].We add a wavy line under the target context tokens (black) that hit the reference translation.We also report the CMLM 10 in the 5th iteration that has closing mask rate to HRT.

Table 10 :
Compare the BLEU score, elapsed time, and relative speedup ratio (α) of decoding En→De newstest14 under different settings.We use b at =5, b nat =1 and k=2 for HRT unless otherwise stated.HRT(b at =5, b nat =5) cannot decode data with batch size 32 (denoted by N/A) on GPU due to insufficient GPU memory.We bold the best results.Green denotes the result is worse than AT baseline.

Table 11 :
The performance of autoregressive models in the synthetic experiment.