Progressive Multi-Granularity Training for Non-Autoregressive Translation

Non-autoregressive translation (NAT) significantly accelerates the inference process via predicting the entire target sequence. However, recent studies show that NAT is weak at learning high-mode of knowledge such as one-to-many translations. We argue that modes can be divided into various granularities which can be learned from easy to hard. In this study, we empirically show that NAT models are prone to learn fine-grained lower-mode knowledge, such as words and phrases, compared with sentences. Based on this observation, we propose progressive multi-granularity training for NAT. More specifically, to make the most of the training data, we break down the sentence-level examples into three types, i.e. words, phrases, sentences, and with the training goes, we progressively increase the granularities. Experiments on Romanian-English, English-German, Chinese-English, and Japanese-English demonstrate that our approach improves the phrase translation accuracy and model reordering ability, therefore resulting in better translation quality against strong NAT baselines. Also, we show that more deterministic fine-grained knowledge can further enhance performance.


Introduction
Non-autoregressive translation (NAT, Gu et al., 2018) has been proposed to improve the decoding efficiency by predicting all tokens independently and simultaneously. Different from autoregressive translation (AT, Vaswani et al., 2017) models that generate each target word conditioned on previously generated ones, NAT models suffer from the multimodality problem (i.e. multiple translations for a single input), in which the conditional * Liang Ding and Longyue Wang contributed equally to this work. Work was done when Liang Ding and Xuebo Liu were interning at Tencent AI Lab.  Table 1: Translation performance at different granularity on the WMT14 English⇒German dataset. " " indicates the performance gap between the NAT and AT.
independence assumption prevents a model from properly capturing the highly multimodal distribution of target translations. To reduce the modes of training data, sequence-level knowledge distillation (KD) (Kim and Rush, 2016) is widely employed via replacing their original target samples with sentences generated from an AT teacher (Gu et al., 2018;Ren et al., 2020).
Although KD reduces the learning difficulty for NAT, there are still complicated word orders and structures (Gell-Mann and Ruhlen, 2011) in the synthetic sentences, making the NAT performance sub-optimal. To answer this challenge, Saharia et al. (2020);Ran et al. (2021) propose to lowers the bilingual modeling difficulties under the monotonicity assumption, where bilingual sentences are in the same word order. However, they make extensive modifications to model structures or objectives, limiting the applicability of their methods to a boarder range of tasks and languages.
Accordingly, we turn to break down the sentencelevel high modes into finer granularities, i.e. bilingual words and phrases, where we assume that finer granularities are easy to be learned by NAT. As shown in Table 1, we analyzed the translation accuracy at three linguistic levels (i.e. word, phrase and sentence) and found that although KD brings promising improvements at three granular-

Sentence
He is very good at English. ities, there are still some gaps with AT teacher. Also, we showed that finer granularities are easier to be learned, that is, accuracy gap "∆" of WORD is small than that of PHRASE, and SEN-TENCE (0.8<1.8<2.2). Thus, we propose a simple and effective training strategy to enhance the ability to handle the sentence-level high modes. More specifically, we generate bilingual lexicons from parallel data by leveraging word alignment and phrase extraction in statistical machine translation (SMT, Zens et al., 2002). Then we guide the NAT model to progressively learn the bilingual knowledge from low to high granularity. Experimental results on four commonly-cited translation benchmarks show that our proposed PROGRESSIVE MULTI-GRANULARITY (PMG) training strategy consistently improves the translation performance. The main contributions are: • Our study reveals that NAT is better at learning fine-grained knowledge. Training with sentences merely may be sub-optimal.
• We propose PMG training to encourage NAT models to learn from easy to hard. The finegrained knowledge distilled by SMT will be dynamically transferred during training.
• Experiments across language pairs and model structures show the effectiveness and universality of PMG training.

Motivation
We investigated theories in second-language acquisition: one usually learns a foreign language from word-to-word translation to sentence-to-sentence translation, namely from local to global (Onnis et al., 2008). Bilingual knowledge is at the core of adequacy modeling (Tu et al., 2016), which is a major weakness of the NAT models due to the lacks of autoregressive factorization. Table 2 demonstrates the English⇒Chinese multimodality at different granularities (i.e. word, phrase, sentence levels). As seen, the sentence-level consists of various kinds of modes, including word alignment ("English" vs. "英语"/"英文"), phrase translation ("be good at" vs. "...非常 擅长..."/"...水平 很 高"), and even reordering ("英语" can be subject or object). However, phrase-level modes are less complex with similar structure and word-level modes are simple with token-to-token mapping. Generally, the lower level of bilingual knowledge, the easier for NAT to learn. This example explains why the sentence level performance gaps between NAT and AT are significant than that of word and phrase in Table 1. Based on the above evidence, it is natural to suspect that the existing sentence-level NAT training is sub-optimal.

Fine-grained Bilingual Knowledge
Phrase table is an essential component of SMT systems, which records the correspondence between bilingual lexicons (Koehn and Callison-Burch, 2009). For each training example in the original training set, we sample its all possible intersentence bi-lingual phrases from the phrase table that obtained with phrase-based statistical machine translation (PBSMT) model (Koehn et al., 2003). The GIZA++ (Och and Ney, 2003) was employed to build word alignments for the training datasets. We leave the exploitation of more advanced forms bilingual knowledge such as syntax rules (Liu et al., 2006) and discontinuous phrases (Galley and Manning, 2010) for future work. Take the sentence pair in Table 2 for example, we can obtain the bilingual En-Zh phrase pairs "very good ||| 很好", "good at English ||| 擅长英语" from original sentence pair, informing the NAT model the explicit phrase boundaries.

Setup
Data Experiments were conducted on four widely-used translation datasets: WMT14 English-German (En-De), WMT16 Romanian-English (Ro-En), WMT17 Chinese-English (Zh-En) and WAT17 Japanese-English (Ja-En), which consist of 4.5M, 0.6M, 20M and 2M sentence pairs, respectively. It is worthy noting that Ro-En, En-De and Zh-En are low-, medium-and high-resource language pairs, and Ja-En is word order divergent language direction. We use the same validation and test datasets with previous works for fair comparison. To avoid unknown works, we preprocessed data via bytepair encoding (BPE) (Sennrich et al., 2016) with 32K merge operations. We evaluated the translation quality with BLEU (Papineni et al., 2002) with statistical significance test (Collins et al., 2005). For fine-grained bilingual knowledge, e.g. word alignment and phrase table, to ensure the source to target mapping more deterministic, we set 0.05 as the probability threshold. Taking WMT14 En-De for example, there are 3M words and 156M phrases in the original phrase table extracted by SMT methodology. We then filter the items whose translation probability is lower than 0.05 and obtain 0.3M words and 56.5M phrases as the final data.
Non-Autoregressive Models We validated our progressive multi-granularity training strategy on two state-of-the-art NAT model structures: • Mask-Predict (MaskT, Ghazvininejad et al. 2019) that uses the conditional mask LM (Devlin et al., 2019) to iteratively generate the target sequence from the masked input; • Levenshtein Transformer (LevT, Gu et al. 2019) that introduces three steps: deletion, placeholder prediction and token prediction.
For regularization, we empirically set the dropout rate as 0.2, and apply weight decay with 0.01 and label smoothing with = 0.1. We train batches of approximately 128K tokens using Adam (Kingma and Ba, 2015). The learning rate warms up to 5 × 10 −4 in the first 10K steps, and then decays with the inverse square-root schedule. We train 50k steps on word-level data and 50k steps on phrase-level data, respectively. And then update the remaining 200K steps for sentence-level training. Following the common practices (Ghazvininejad et al., 2019;Kasai et al., 2020), we evaluate the performance on an ensemble of 5 best checkpoints (ranked by validation BLEU) to avoid stochasticity.

Autoregressive Teachers
We closely followed previous works to apply sequence-level KD. More precisely, we trained two kinds of Transformer (Vaswani et al., 2017) Table 3 lists the results of previous competitive NAT models (Gu et al., 2018;Kasai et al., 2020;Gu et al., 2019;Ghazvininejad et al., 2019). Clearly, our approach "+PMG Training" consistently improves translation performance (BLEU↑) over four language pairs. Specifically, our PMG training strategy achieves on average +0.53 BLEU scores improvements on four language pairs upon two NAT model structures. Note that our approaches introduce no extra parameters, thus does not increase any latency ("Speed").

Main Results
Comparison to Curriculum Learning The existing CL methods can be divided into two categories, "Discretized CL (DCL)" (Zhang et al., 2019) and "Continuous CL (CCL)" (Platanios et al., 2019). Sentence length is the most significant variable in our multi-granularity data, therefore we implemented discretized and continuous CL with the sentence length (source side) criteria. Our DCL setting explicitly predefined the number of data bins, while CCL method continuously samples the shorter examples with the training progresses. For DCL, we split the training samples into a predefined number of bins (5, in our case). As for CCL, we employ their length cur-riculum and square root competence function. We find that on WMT14 En-De dataset with MaskT model, DCL performs worse than KD baseline (-0.6 BLEU) while CCL outperforms KD baseline by +0.3 BLEU points. Our approach (+0.6 BLEU) is the most effective one.

Analysis
In this section, we conducted analytical experiments to better understand what contributes to translation performance gains. Specifically, we investigate whether the PMG 1) enhance the phrasal pattern modeling ability? 2) improve the reordering? and 3) gain better performance with higher quality fine-grained knowledge?
Better Phrasal Pattern Modelling Our method is expected to pay more attention on the bi-lingual phrases, leading to better phrase translation accuracy. To evaluate the accuracy of phrase translations, we calculate the improvement over multiple granularities of n-grams in Table 4, our PMG training "NAT w/ PMG" consistently outperforms the baseline, indicating that our proposed multigranularity training indeed raise the ability of NAT model on capturing the phrasal patterns.
Better Reordering Ability The SMT-distilled bilingual phrasal information could intuitively inform the NAT model the bi-lingual phrasal boundaries, leading to better reordering ability. We compare the reordering ability of NAT model w/ & w/o PMG training with RIBES 1 (Isozaki et al., 2010), which is designed for measuring the reordering performance for distant language pairs. We cate-  gorize the test set into several bins according the sentence length and report the BLEU and RIBES scores, simultaneously in Figure 1. As seen, the proposed PMG training strategy could improve the translation (BLEU↑) and reordering performance (RIBES↑), confirming our claim. Our finding is consistent with Ding et al. (2020a), where they explicitly injected the SMT-guided alignment information into the MT models, achieving better performance.

Effect of Fine-Grained Text Quality
The acquired fine-grained bilingual knowledge, i.e. word alignments and phrase tables, still have extremely large volumes after filtering. Taking WMT14 En-De for example, there are over 56M phrase pairs after filtering with translation probability threshold 0.05. To make the knowledge being more deterministic, we control the quality of fine-grained text with the third party scorer -BERTScore (Zhang et al., 2020). As illustrated in Table 5, keeping the high quality bilingual knowledge (e.g. 50%) can achieve further improvements, showing the great potential of our approach. We will leave the exploration of high-quality bilingual knowledge for NAT as a future work.

Related Works
Non-Autoregressive Translation There still exists a performance gap between AT teacher and its NAT student. To bridge this gap, many studies have been proposed. Ghazvininejad   most of the parallel data. Differently, we break the sentences into fine-grained granularities to fully exploit the parallel data. Note that our model-agnostic method can be applied to any NAT structures.
Curriculum Learning Our proposed training strategy is a novel technique for NAT by exploiting curriculum learning (CL). Recent works have shown that CL can help the autoregressive translation (AT) models achieve fast convergence and better results (Platanios et al., 2019;Liu et al., 2020b;Zhan et al., 2021;. However, CL for non-autoregressive translation (NAT) models has not been well studied. Among the few attempts, Guo et al. (2020a);  respectively investigated "parameter-and task-level" curriculum learning approaches, while we proposed progressive multi-granularity training for NAT at "datalevel". To the best of our knowledge, this is the first work to investigate the effects of different granularities of data on NAT models.

Conclusion
In this paper, we investigated the translation accuracy of different granularities in NAT, and found that the NAT models are better at dealing with fine-grained bilingual knowledge (e.g. words and phrases). Based on this finding, we proposed a simple progressive multi-granularity training strategy. Experiments show that our approach consistently and significantly improves translation performance across language pairs and model architectures. Indepth analyses indicate that our approach generates better word order and phrase patterns, outperforming typical curriculum learning methods.