Helping the Weak Makes You Strong: Simple Multi-Task Learning Improves Non-Autoregressive Translators

Recently, non-autoregressive (NAR) neural machine translation models have received increasing attention due to their efficient parallel decoding.However, the probabilistic framework of NAR models necessitates conditional independence assumption on target sequences, falling short of characterizing human language data.This drawback results in less informative learning signals for NAR models under conventional MLE training, thereby yielding unsatisfactory accuracy compared to their autoregressive (AR) counterparts.In this paper, we propose a simple and model-agnostic multi-task learning framework to provide more informative learning signals.During training stage, we introduce a set of sufficiently weak AR decoders that solely rely on the information provided by NAR decoder to make prediction, forcing the NAR decoder to become stronger or else it will be unable to support its weak AR partners.Experiments on WMT and IWSLT datasets show that our approach can consistently improve accuracy of multiple NAR baselines without adding any additional decoding overhead.


Introduction
State-of-the-art neural machine translation (NMT) systems are mainly autoregressive (AR) models (Bahdanau et al., 2015;Vaswani et al., 2017), which decompose the joint probability of a sequence of tokens in a left-to-right order, modeling dependencies of each token with its preceding ones.Despite having strong performance, such sequential decoding causes considerable latency, thereby unsatisfactory efficiency.
In contrast, non-autoregressive (NAR) translation models (Gu et al., 2018) permit potentially more efficient parallel decoding.To do so, NAR Figure 1: Illustration of our approach, where we introduce a set of auxiliary weak AR decoders, each of which must make its predictions solely relying on the information contained in the NAR decoder hidden states.Thus, the information provided by the NAR decoder must be sufficiently useful for the AR decoders to be capable of predicting the target sequence because the AR decoders are parameterized as weakly as possible, which will in turn let the NAR decoder learn to get stronger.models necessitate a notorious conditional independence assumption on target sequences as a tradeoff.This assumption, however, is probalistically insufficient to describe the highly multi-modal nature of human language data, imposing severe challenges for NAR models in a way of yielding less informative learning signals and gradients under the conventional MLE training.As a result, NAR models often manifest implausible neural representations, especially in the decoder part as the decoder governs the generation, resulting in significant performance sacrifice.To close the accuracy gap, a majority of previous studies aim at improving the modeling of dependencies with more conditional information (Qian et al., 2021;Ghazvininejad et al., 2019).We argue that these research efforts are equivalent to providing better alternative learning signals without changing the NAR models' probabilistic framework.However, most of these methods require a specific modification to the commonly-used Transformer model architecture.
A natural question may arise: can we encourage the NAR decoder to learn from sources of signals that are more informative than that of the conditional independence assumption, in order to better capture target dependencies?It would be more advantageous if it is also modification-free regarding model architectures and could also used with all current NAR systems.
In this paper, we propose a simple multi-task learning framework that introduces auxiliary weak AR decoders to make NAR models stronger.The key idea is that we parameterize the auxiliary AR decoders as weakly as possible and force them to predict target sequences solely based on the information from NAR decoder's hidden representations, such that they can no longer model the target sequence on their own unless the knowledge provided by the NAR decoder is sufficiently useful.As a result, the NAR decoder has no choice but to become stronger so as to support the AR partners that are poorly parameterized.Additionally, our approach is plug-and-play and model-agnostic, and the weak AR decoders that we introduce are discarded during the inference stage, resulting in no additional decoding overhead.
We empirically evaluate its applications to several classes of NAR model, including vanilla NAR Transformer (Gu et al., 2018) and its CTCbased variant (Libovický and Helcl, 2018;Saharia et al., 2020).Experiments on widely-used WMT14 English-to-German, WMT16 Englishto-Romanian, and IWSLT14 German-to-English benchmarks show that our approach consistently helps build more accurate NAR models over strong baselines.

Preliminary
Neural machine translation (NMT) is formally defined as a conditional probability model p(y|x; θ) parameterized by deep neural networks θ.Given an input sequence x = (x 1 , x 2 , • • • , x m ) , a neural autoregressive model (Bahdanau et al., 2015;Vaswani et al., 2017) predicts the target sequence y = (y 1 , y 2 , • • • , y n ) sequentially based on the conditional distribution, which decomposes p(y|x; θ) by the autoregressive factorization: where θ is the set of model parameters.Although such factorization achieved great success, its se-quential prediction may cause high decoding latency and error accumulation during inference, especially for long sentences.

Non-autoregressive Translation.
To solve above problems, Gu et al. (2018) proposed nonautoregressive Transformer based on conditional independence assumption among target tokens, which models p(y|x; θ) in a per-token factorization: As a result, NAR models can boost up inference by predicting target words simultaneously, thereby improving the efficiency significantly.
However, as noted in Gu et al. (2018), the target-side conditional independence assumption prohibits NAR models from capturing complex dependencies among target tokens, thereby significantly hurting accuracy.To mitigate this, a line of work proposes to modify the training objective (Libovický and Helcl, 2018;Wang et al., 2019;Shao et al., 2020;Ghazvininejad et al., 2020;Qian et al., 2021;Du et al., 2021), while other work uses latent variable to enhance modeling (Kaiser et al., 2018;Shu et al., 2020;Bao et al., 2021Bao et al., , 2022)).Besides, several research proposes iterative-based models, which perform iterative refinement of translations based on previous predictions (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Kasai et al., 2020).The most related work to this paper is Hao et al. (2021), which shows that utilizing an additional AR decoder could help the encoder of NAR models contain more linguistic knowledge.

Methodology
In this section, we will dive deep into our simple yet effective multi-task learning framework, including model architecture and training scheme.
Model Architecture.The overall illustration of our approach is depicted in Figure 1.Specifically, for every NAR decoder layer, we introduce an auxiliary weak AR decoder, where each AR decoder is parameterized by one Transformer layer, being as weak as possible.In this case, these AR decoders will no longer capture the underlying structure of target sequences on their own, unless their NAR decoder layers can provide useful neural representations.As a result, the NAR decoder layers can additionally learn from such informative task signals and become stronger to support the weak AR partners, being forced to capture sufficient context and dependency information.
Training Objective.Our training objective composes two parts for the NAR model of interest and the auxiliary weak AR decoders, respectively.For the NAR part, we keep the original model-specific training objective unaltered.For instance, we apply CTC loss for CTC-based NAR models (Saharia et al., 2020).As for the AR decoders, we apply the cross-entropy loss for training.The final loss is a weighted sum of the two components: where N is the number of NAR decoder layers, and L NAR and L AR represent the NAR loss and AR loss, respectively.The λ is a predefined weight.
Glancing Training.According to previous studies, glancing training (Qian et al., 2021) can considerably improve the translation quality of non-iterative NAR models.We apply glancing training technique to our method.More specifically, we first randomly sample reference tokens as NAR decoder inputs like Qian et al. (2021), and then let the weak AR decoder make predictions based on the NAR decoder hidden states.
Minimizing Training Cost.The major challenge of our method is additional training computational and memory overhead.To this end, we employ two techniques to reduce training costs: (a) Parameter-sharing of AR decoders.As all AR decoders are homogeneous, we can tie their parameters to reduce the total number of parameters.
(b) Layer dropout for AR decoders.Simultaneously enabling every NAR decoder layer to pair its AR decoder partner is fairly inefficient.To this end, we randomly select half of the AR decoders, instead of all, for multi-task learning.
Both strategies help make the training cost affordable without losing accuracy gains.
Inference.We only use the NAR decoder for inference without any AR decoders.The AR decoder is only used for training.Therefore, our approach has no additional decoding overhead.

Experiments
Experimental Settings.We conduct experiments on the most widely used machine translation benchmarks: WMT14 English-German (WMT14 EN-DE, 4.5M translation pairs), WMT16 English-Romanian (WMT16 EN-RO, 610K translation pairs) and IWSLT14 German-English (IWSLT14 DE-EN, 160K translation pairs).We follow Gu and Kong (2021) for data preprocessing and use BLEU (Papineni et al., 2002) as the evaluation metric.To alleviate training difficulties, we use sequence-level knowledge distillation (Hinton et al., 2015) for all datasets to alleviate multimodality problem as in Gu et al. (2018).

Main Results
Our approach achieves superior results compared to existing strong NAR systems.Table 1 presents our main results on the benchmarks.As seen, our method significantly improves the translation quality and outperforms other strong baseline models.Besides, when applying the glancing training technique, our method can result in further advancements.Compared with CMLM, which employs iterative decoding, our model can achieve higher performance, while using singlestep generation.Hao et al. (2021)'s work is related to ours, which also utilizes a multi-task framework.We reproduce their method on the CTC-based NAR model, and results show that our method can achieve greater improvements.Compared with the strong autoregressive teacher Transformer (Vaswani et al., 2017), our model can further close the performance gap.And when decoding using beam search, our method can outperform Transformer on each dataset.
Our model-agnostic approach can help boost several classes of NAR models.We use Vanilla-NAR (Gu et al., 2018) and CTC (Saharia et al., 2020) models as baselines and apply our multi-task learning approach to each baseline model.The result is shown in Table 2.It can be seen that our method consistently and significantly improves the translation quality for each baseline model and each language pair.This illustrates the generality of our method.

Analysis
Does AR decoders being weak really matter?Recall that we let AR decoder be sufficiently weak to force NAR decoder to be strong.But how does the capacity of AR decoders affect the efficacy of our approach?We hence conduct experiments with the different number of AR decoder layers.e.g., 1, 3, and 6.As demonstrated in Figure 2 (Ghazvininejad et al., 2019) 18.05 21.83 27.32 28.20 / Flowseq (Ma et al., 2019) 23.72 28.39 29.73 30.72 27.55 NAR-DCRF (Sun et al., 2019) 23.44 27.22 / / 27.44 CTC (Saharia et al., 2020) 25.7 28.1 32.2 31.6 / AXE (Ghazvininejad et al., 2020) 23.5 27.9 30.75 31.54/ O A XE (Du et al., 2021) 26.1 30.2 32.4 33.3 / CNAT (Bao et al., 2021) 25.56 29.36 / / 31.15GLAT (Qian et al., 2021) 25.21 29.84 31.19 32.04 / GLAT+CTC (Qian et al., 2021) 26.39 29.54 32.79 33.84 / DSLP (Huang et al., 2022) 27   depth AR decoder can bring improvement, but as the number of AR decoder layers increases, the improvement effect for NAR gradually weakens.This verifies our motivation that a weaker AR decoder force NAR decoder to contain more useful information, in turn helping the NAR model.creases, the performance gap between our model and the Transformer decreases.Remarkably, our model outperforms Transformer when the target sentence length is greater than 60.Longer sentences mean that the model needs to deal with more complex contextual associations.We conjecture that our proposed multi-task training method significantly improves the contextual information contained in the NAR hidden state, and thus has better performance on long sentence translation.
Our approach reduces token repetitions.We also study the rate of repeated tokens as in (Saharia et al., 2020) to see to what extent our approach can tackle the multi-modality problem.Table 5 shows the repetition before and after applying our approach, demonstrating that our method consistently reduces the occurrence of repeated words by a significant margin.Even when equipping CTC alone can alleviate the repetition issue, our approach can give rise to further improvements.
Performance without knowledge distillation.Despite knowledge distillation as a commonly-used workaround, it bounds the performance of NAR models under their AR teacher, along with the extra need to build teacher models.To validate the effectiveness of our method in the raw data scenario, we conduct experiments on the WMT14 and IWSLT14 datasets without knowledge distillation.
As shown in Table 6, the baseline CTC model can be significantly enhanced by our approach, further closing the performance gap with the AR model.
Advantages of our method over other multi-task framework.Hao et al. (2021)'s work also utilizes a multi-task framework, and our method can make greater improvements.We attribute this to the location and capacity of our multi-task learning module, i.e. the weak AR decoder.For the location of the AR decoder, we argue that the decoder governs the generation, so placing the AR decoder upon the NAR decoder is supposed to more directly and explicitly improve the generation of NAR, while Hao et al. (2021) is based on the NAR encoder output.
For the capacity of the AR decoder, we contend that the AR decoders should be as weak as possible, such that they can no longer model the target sequence on their own unless their NAR decoder layers can provide useful neural representations.
In contrast, Hao et al. (2021) do not elaborate on parameterization capacity and use a standard AR decoder.

Conclusion
In this paper, we propose a multi-task learning framework for NAR.Along with the training of the weak AR decoder, the NAR hidden state will contain more contextual information, resulting in performance improvement.Experiments on WMT and IWSLT benchmarks show that our method can significantly and consistently improve the translation quality.When using beam search decoding, our CTC-based variant outperforms strong Transformer on all of the benchmarks, while introducing no additional decoding overhead.

Limitations
Our research's potential drawback is that it adds to the training burden.To tackle this problem, we introduce two techniques to reduce training costs.We greatly minimize the number of parameters that should be trained as well as the training time without sacrificing performance.Notably, our method does not introduce additional overhead for inference.Therefore, we can achieve a large performance improvement while maintaining the original fast decoding speed.

Figure 2 :
Figure 2: Results on the test of IWSLT14 to analyze the effectiveness of the number of AR decoder layers.We use CTC-based model as baseline, and w/ 1 layer means the AR decoder has 1 layer.

Table 1 :
Results of NAR models trained with knowledge distillation on test set of WMT14, WMT16 and IWSLT14.CMLM k refers to k iterations of decoding.

Table 2 :
Results of applying our method to different NAR models, showing the generality of our method.
Ablation study on the training cost optimization.We evaluate the impact of the proposed training

Table 3 :
Study on training cost reduction.

Table 5 :
Results of repeated token percentage.