Revisiting Non-Autoregressive Translation at Scale

In real-world systems, scaling has been critical for improving the translation quality in autoregressive translation (AT), which however has not been well studied for non-autoregressive translation (NAT). In this work, we bridge the gap by systematically studying the impact of scaling on NAT behaviors. Extensive experiments on six WMT benchmarks over two advanced NAT models show that scaling can alleviate the commonly-cited weaknesses of NAT models, resulting in better translation performance. To reduce the side-effect of scaling on decoding speed, we empirically investigate the impact of NAT encoder and decoder on the translation performance. Experimental results on the large-scale WMT20 En-De show that the asymmetric architecture (e.g. bigger encoder and smaller decoder) can achieve comparable performance with the scaling model, while maintaining the superiority of decoding speed with standard NAT models. To this end, we establish a new benchmark by validating scaled NAT models on the scaled dataset, which can be regarded as a strong baseline for future works. We release code and system outputs at https://github.com/DeepLearnXMU/Scaling4NAT.


Introduction
Recent years have seen a surge of interest in non-autoregressive translation (NAT) (Gu et al., 2018), which can improve the decoding efficiency by predicting all tokens independently and simultaneously.The majority studies on NAT focus on the base models trained on medium-scale datasets (e.g., Mask-Predict: 69M; WMT14 En-De: 4.5M) (Ghazvininejad et al., 2019), while scaled models and datasets become de facto standard for autoregressive translation (AT) models (e.g., Transformer-Big: 226M; WMT20 En-De: 45.1M) (Ott et al., 2018).The model-and datalevel gaps make the progress of NAT lag behind that of AT, which limits the applicability of NAT models to practical scenarios.
This general tendency motivates us to boost NAT models from the scaling perspective, including the amounts of training data and the model size.In this paper, we aim to provide empirical answers to the following research questions: • RQ1: How does scaling affect NAT behaviours in terms of translation quality and decoding speed?Scaling neural networks brings dramatic quality gains over translation tasks using AT models (Arivazhagan et al., 2019), and revisiting existing methods on a large-scale data can obtain more consistent conclusions (Edunov et al., 2018).• RQ2: Have performance improvements of scaling been accompanied by alleviating commonlycited weaknesses of NAT? Several weaknesses exist in NAT, including multimodality problem (Gu et al., 2018), non-fluent outputs (Du et al., 2021) and inadequate translations (Ding et al., 2021c).• RQ3: Can we establish a new NAT benchmark to reliably translate leaderboard scores to improvements in real-world use of the models?Although previous studies of NAT have achieved comparable performance with the AT models, they are still validated on small-scale datasets and model sizes using inconsistent evaluation criteria.These gaps make the progress of NAT lag behind that of AT, which limits the applicability of NAT models to practical scenarios.
To answer these research questions, we investigate the effects of different scaling methods on two advanced NAT models.Experimental results show that scaling works well with knowledge distillation to alleviate commonly-cited weaknesses of NAT.The scaled NAT models achieve better translation quality at the expense of decreasing decoding speed.To balance effectiveness and efficiency, we com-pare various component-scaled NAT models and find that scaling architecture in NAT is more asymmetric than that in AT.Accordingly, we introduce a cone architecture for NAT with a deeper and wider encoder and a shallower and narrower decoder, which boosts translation performance and maintain the decoding speed.Specifically, our main contributions are as follows: • We demonstrate the necessity of scaling model and data for NAT models, which narrows the progress gap between NAT and AT models.• Our study reveals positive effects of scaling on commonly-cited weakness, which makes the standard NAT model sub-optimal.• We establish a new benchmark, where we evaluate competing scaled NAT models on large-scale datasets in terms of effectiveness and efficiency.• We provide a better understanding of NAT at scale to help prioritize future exploration towards making NAT a common translation framework.

Non-Autoregressive Translation
Given a source sentence x = {x 1 , x 2 , . . ., x T X }, an AT model generates each target word y t conditioned on previously generated ones y <t , leading to high latency on the decoding stage.In contrast, NAT models break this autoregressive factorization by producing all target words independently and simultaneously.Formally, the probability of generating y = {y 1 , y 2 , . . ., y T Y } is computed as: p(y|x) = T Y t=1 p(y t |x; θ) where T Y is the length of the target sequence, which is usually predicted by a separate conditional distribution.The parameters θ are trained to maximize the likelihood of a set of training examples according to L(θ) = arg max θ log p(y|x; θ).

Knowledge Distillation
Training NAT suffers from the multimodality problem, where the conditional independence assumption prevents a model from properly capturing the highly multimodal distribution of target translations (Gu et al., 2018).Accordingly, the sequence-level knowledge distillation (Kim and Rush, 2016) is introduced to reduce the modes of training data by replacing their original target-side samples with sentences generated by an AT teacher (Gu et al., 2018;Zhou et al., 2020).Formally, the original parallel data D Raw and the distilled data D KD can be defined as , where f s →t represents an AT model trained on D Raw for translating sentences from the source to the target language.N is the total number of sentence pairs in the training data.

Advanced Models
The conditional independence assumption results in a performance gap between the NAT model and its AT teacher.A number of recent efforts have explored ways to bridge the performance gap with advanced architectures (Ghazvininejad et al., 2019;Gu et al., 2019;Ding et al., 2020) or training objectives (Shao et al., 2019;Ghazvininejad et al., 2020;Du et al., 2021).Another thread of work focuses on understanding and improving distillation training (Zhou et al., 2020;Ding et al., 2021c;Huang et al., 2022;Ding et al., 2021aDing et al., ,b, 2022)).Generally, NAT models can be divided into two categories: Iterative NAT is proposed to refine previously generated words in each iteration, which allows NAT models to generate target words by capturing partial and noisy dependencies.Mask-Predict (MaskT) (Ghazvininejad et al., 2019) uses the conditional masked language model (Devlin et al., 2019) to iteratively generate the target sequence from the masked input.Levenshtein Transformer (Gu et al., 2019) introduces three steps: deletion, placeholder prediction and token prediction, and the decoding iterations adaptively depend on certain conditions.
Fully NAT is trained to produce one-pass decoding without sacrifice of speed-up.Several studies have been proposed to improve the fully NAT models (Qian et al., 2021;Gu and Kong, 2021).GLAT adopts an adaptive glancing sampling strategy for training, which can be seen as a method of curriculum learning.Furthermore, Gu and Kong (2021) build a new SOTA fully NAT model by combining useful techniques in four perspectives, including training data, model architecture, training objective and learning strategy.

Experimental Setup
Datasets We not only experiment on the widelyused WMT16 English-Romanian (0.6M) and WMT14 English-German (4.5M) benchmarks, but also broaden the investigation on a large-scale dataset WMT20 English-German (45.1M).We tokenize data using the Moses toolkit, and then split them into subwords using a joint BPE (Sennrich et al., 2016) with 32K merge operations.This forms a shared vocabulary of 32k, 37k, and 49k for WMT16 En-Ro, WMT14 En-De and WMT20 En-De respectively.Both AT and NAT models are trained on KD data, except as otherwise noted.To generate KD data, we employ Transformer-Big and Transformer-Base as teachers to distill the En-De and En-Ro datasets, respectively.
NAT Models We validate two advanced models, representing iterative and fully NAT respectively: • MaskT (Ghazvininejad et al., 2019) where we follow its optimal settings to keep the iteration number be 10 and length beam be 5. • GLAT (Qian et al., 2021) where we follow their reported configurations to set iteration number and length beam as 1.Models are re-implemented on top of the Fairseq framework (Ott et al., 2019), which supports training on multiple GPU instances.We employ largebatch training (i.e.480K tokens/batch) to optimize the performance (Ott et al., 2018).We train all NAT models for 300K steps to ensure adequate training, apart from WMT16 En-Ro (30K steps).Following the common practices (Ghazvininejad et al., 2019;Kasai et al., 2020), we evaluate the performance on an ensemble of 5 best checkpoints (ranked by validation BLEU) to avoid stochasticity.More details about NAT training are presented in Appendix A.6.

AT Teachers
We closely follow previous works on NAT to apply sequence-level knowledge distillation to reduce the modes of the training data.We trained BASE and BIG Transformer (Vaswani et al., 2017) as the AT teachers for En↔Ro and En↔De tasks, respectively.We adopt large-batch training (i.e.458K tokens/batch) to optimize the performance of AT teachers (Ott et al., 2018).Specially, the AT teachers are trained on raw data.
Evaluation For fair comparison, we use caseinsensitive tokenBLEU (Papineni et al., 2002) to measure the translation quality on WMT16 En-Ro and WMT14 En-De.We use SacreBLEU (Post, 2018) for the new benchmark WMT20 En-De.
We mainly investigate behaviors of NAT-Big as it has similar performance with NAT-Deep, while the training of NAT-Big is more stable.• Data Scaling: The commonly-used datasets for NAT are WMT16 En-Ro and WMT14 En-De, whose sizes are smaller than current AT benchmarks.We mainly experiment NAT models on the WMT20 En-De dataset, which is 10 times larger than previous ones (i.e.WMT16: 0.6M; WMT14: 4.5M; WMT20 45.1M).

Translation Quality
Results on Benchmarks Table 1 lists the results on the six benchmarks: WMT16 En↔Ro, WMT14 En↔De, and WMT20 En↔De, which are small-, medium-and large-scale datasets, respectively.We experiment MaskT and GLAT models of whose configurations are detailed in Section 2.3.Compared with standard NAT models ("+ Knowledge Distillation"), scaling method ("+ Both") significantly and consistently improves translation performance (BLEU↑) on medium and large-scale datasets.However, the improvement is not robust on small-scale dataset.An interesting finding is that both model scaling and data scaling are able to narrow the performance gap between fully and iterative NAT models.After model scaling, the average difference between MaskT and GLAT drops from +1.2 ("+ Knowledge Distillation" lines) to +0.5 ("+ Both" lines).Encouragingly, advanced NAT models with model-scaling can perform better than strong AT teachers on larger-scale data.
As seen, the performance of "MaskT+Both" is +0.5 higher than the Transformer-Big models on WMT20 En↔De.This confirms the necessarily of scaling model size and data for building practical and robust NAT systems.
Complementary between Scaling and KD KD is a commonly-used training recipe to boost NAT performance.As shown in Table 1, KD ("+ Knowledge Distillation") can benefit more for fully NAT than iterative NAT models compared with Raw (+4.1 vs. +2.3BLEU scores averagely).We also find that KD is more effective on large-scale datasets, where the average improvements are +4.7 and +2.5 on WMT20 and WMT16+14, respectively.especially on large data.The model scaling ("+ Width Scaling") can also improve NAT models by enhancing the model ability on learning difficult data.The conclusions of model scaling are similar to KD: 1) it benefits more for fully NAT (+1.0 vs. +2.2BLEU); 2) it is more effective on large-scale datasets (+3.0 vs. +0.9BLEU).Combining scaling with KD ("+ Both") can further improve standard MaskT and GLAT ("+ Knowledge Distillation") by +0.7 and +1.3, which illustrates that they exhibit complementary properties for NAT models.
We extensively analyze the reasons behind this in Section 3.3.Scaling and KD are related to and complement one another for NAT models.The conclusion on complementary between scaling and KD also holds for depth scaling (detailed in Appendix §A.1).The deep models also have similar performance with big ones, but depth scaling is difficult to train with side effect on inference speed.Therefore, we employ NAT-Big as our testbed in following experiments, unless otherwise specified.

Difference between NAT and AT Scaling
The scaling behavior of AT models has been studied (Wang et al., 2019a), which seems similar to NAT in terms of BLEU score.Different from autoregressive Transformer, NAT predicts target tokens independently and simultaneously, which may lead to different scaling behaviors of NAT models.
Starting from this intuition, we further compare NAT and AT scaling from the perspective of linguistic properties.Probing tasks (Conneau et al., 2018) can quantitatively measure the linguistic knowledge embedded in encoder representations.We follow Hao et al. (2021) to analyze Base and Big models trained on WMT20 En→De KD data.The experimental results on WMT20 En→De raw data are also provided in Appendix §A.3.

Analysis on NAT Weaknesses
We analyze effects of scaling on commonly-cited weaknesses: 1) multimodality indicated by token repetition ratio (Gu et al., 2018); 2) generation fluency calculated by language model (LM) perplexity (Du et al., 2021); 3) translation adequacy measured by word translation accuracy (Ding et al., 2021c).Table 3 shows the results.Examples about NAT weaknesses are listed in Appendix A.5.
Scaling Alleviates Multimodality Problem Repeated token percentage is a commonly-used metric of measuring multimodality in a NAT model (Saharia et al., 2020).A NAT model may consider many possible translations at the same time due to the independent predictions of target tokens.Accordingly, the NAT output typically contains some repetitive tokens, especially for fully NAT (1.1% vs. 2.7%).Similar to KD, scaling is an alternative method to significantly reduce the repetition percentage for NAT models (-0.5% and -1.0%).In addition, combining KD and scaling can further alleviate the repetition problem, which is consistent with the translation quality in Table 1.
Scaling Improves Generation Fluency NAT models typically suffer from fluency problems because they only have limited capabilities to model dependencies between the target tokens (Kasner et al., 2020;Gu and Kong, 2021).We measure the fluency of output with a public released LM,1 which is trained on the News Crawl corpus.The  We also list the results of reference ("Golden").
results show that either KD or scaling can consistently decrease the PPL in all cases (-7∼-34).We attribute the improvement of fluency to that KD reduces the learning difficulty by simplifying training data while scaling enhances the model ability by introducing larger parameters.Besides, the complementarity between KD and scaling still holds in terms of fluency measurement.Encouragingly, scaled model without KD performs closely to the standard NAT models, showing that scaling has the potential to directly learn from the raw data of complex modes.
Scaling Enhances Translation Adequacy NAT often suffers from two kinds of adequacy errors which were empirically observed by previous studies: 1) incomplete translation, due to incomplete transfer of source side information (Wang et al., 2019b); (2) lexical choice, due to choosing a target lexeme which inadequately expresses the source meaning (Ding et al., 2021c).Following Neubig et al. (2019), we measure the word accuracy which defined as F-measure of system outputs with respect to the reference.It can demonstrate how much a system over-or under-produces words of a specific type as well.As expected, NAT models with KD or scaling have higher word accuracy (+0.9%∼+1.9%),resulting in better translation performance (BLEU↑ in  6)×1024) on the WMT20 En→De task.We test the decoding speed when translating one sentence (Speed 1 ) or hardware-maximum mini-batches (Speed max ).# is the absolute value and ∆ is the speedup ratio over NAT-Base.

Discussion on Decoding Efficiency
Although scaling produces significant performance gains, someone may argue that model scaling introduces more parameters, which will increase latency at decoding stage.Following previous studies, we carefully investigate effects of scaling on decoding efficiency for NAT.We employ two metrics: • Speed 1 , which measures speed when translating one sentence at a time (Gu et al., 2018).This is used in standard practice and aligns with applications like instantaneous MT that translates text input from users immediately.• Speed max , which measures speed when translating in mini-batches as large as the hardware allows (Kasai et al., 2021).This corresponds to scenarios where one wants to translate a large amount of text given in advance.
As illustrated in Table 4, adding 3× parameters definitely decreases the decoding speed (Speed 1 : 0.93× ∼ 0.94× and Speed max : 0.55× ∼ 0.82×).In terms of Speed max , scaling harms the iterative NAT more than fully NAT models (0.55× vs. 0.82×).Besides, we test the decoding speed of the MaskT-Deep model ((24, 24)×512) and find that Speed 1 rapidly declines to 0.28×.These results suggest that scaling method increases translation quality (BLEU ↑) at the expense of decoding speed (Speed ↓), especially on Speed max .
This findings motivates us to design a better scaling architecture for NAT, taking both performance and time cost into consideration.Kasai et al. (2021) pointed out that some NAT models have little advantage when translating a large amount of text given in advance.Accordingly, we use Speed max as default when discussing translation speed.

Model
Speed max BLEU

New NAT Benchmark
Most NAT models are implemented upon the encoder-decoder framework, where the encoder summarizes the source sentence and the decoder learns to generate target words.We ask: how to scale this framework?In this section, we empirically search for a better NAT architecture by considering both effectiveness and efficiency.

Discussion on Architecture Symmetry
Previous studies usually propose asymmetric architectures for AT such as the one with deep encoder and shallow decoder (Kasai et al., 2021).The main reason is that increasing the number of layers, especially in the decoder, deteriorates the latency of translation and memory costs.We verify the architecture symmetry of NAT models by investigating impacts of component-level scaling on translation quality and decoding speed.More specifically, we enlarge the size of layer dimensions in either encoder or decoder, or both components.Table 5 shows results of component-level width-scaling on the WMT20 En-De dataset.Results of componentlevel depth-scaling are shown in Appendix §A.1.
Translation Performance Clearly the scaling approach improves the translation quality in all cases, although there are still considerable differences among the variants ("Component Scaling" vs. "No Scaling").Introducing encoder-and decoderscaling individually improves translation performance over the standard MaskT by +1.8 and +1.2 BLEU points respectively.As seen, scaling encoder and decoder are not equivalent in terms of translation performance.This asymmetric phenomenon is more severe than that in AT models.spend a substantial amount of its capacity in disambiguating source and target words under the conditional independence assumption.However scaling both encoder and decoder cannot always achieve better performance compared with individual scaling.This is opposite to AT models, which can further increase by +0.5 BLEU point.To sum up, 1) scaling NAT is more asymmetric than AT; 2) complementary between encoder and decoder in NAT is weaker than that in AT.
The conclusion still holds on AT models (0.99×).However, scaling decoder has a large impact on decoding speed (MaskT: 1.00× → 0.56×; GLAT: 1.00× → 0.85×).It is worth noting that iterative NAT is more sensitive to decoder-scaling than fully NAT.The main reason is that iterative mechanism should occupy many times of GPU memory, resulting in smaller mini-batches when calculating Speed max .Furthermore, there is no further speed decrease when scaling both encoder and decoder components (MaskT: 0.56× → 0.55×; GLAT: 0.85× → 0.82×).To sum up, 1) The decoding latency is mainly attributed to scaling decoder; 2) Scaling decoder of iterative NAT comes at the cost of a much larger time cost than fully NAT.
Linguistic Probing As discussed in Section 3.3, NAT and AT models have different scaling behaviors on learning word-content linguistics.We further investigate the effects at component level in Table 6.To sum up, asymmetric scaling can enhance the capability of NAT on learning word-content knowledge.The conclusion still holds on AT.

Asymmetric Scaling Method
To find a better scaling architecture, we conduct ablation study on a variety of scaled NAT models.Ablation Study Seven MaskT models with different architectures are investigated on WMT20 En→De dataset.These models are varying from scaling methods (i.e.depth and width) and scaling components (i.e.encoder and decoder).Table 7 shows the variant configurations and the corresponding performances in terms of decoding speedup and translation quality.The #1 is an NAT-Base model, which contains 6 encoder layers and 6 decoder layers with feed-forward dimensions being 512 (i.e.(6, 6)×512).As shown in #2∼4, widening or deepening the encoder component can boost translation quality (BLEU ↑) with decreasing the decoding efficiency lightly (Speed ↓).Compared with the best encoder-scaling architecture (#4), further widening the decoder counterpart (#5) fails to increase the BLEU scores (43.1 vs. 43.1)but decrease the decoding speed (0.95× vs. 1.32×).
To better trade off efficiency and effectiveness, we make the decoder shallower and smaller based on the #4 model.Encouragingly, the #6 and #7 models still achieve comparable translation quality while increasing the speed of decoding to some extent (42.7 vs. 43.0 BLEU and 1.29× vs. 1.58× Speed).This confirms our hypothesis that NAT models need an asymmetric framework when considering both translation quality and decoding speed.
Cone Scaling Motivated by the ablation study, we propose a "Cone" architecture for NAT, of which encoder is deep and big while the decoder is shallow and small (i.e.(12×1024, 3×256)).As shown in MaskT and GLAT models, and evaluate them on six benchmarks.In general, our method achieves comparable performance with big models while retaining low latency during inference.As seen, the conescaling improve standard MaskT model by +0.9 BLEU averagely) and 1.58× decoding speedup (over MaskT-Big by +0.2 BLEU and 2.87× Speed).
Besides, the cone-scaling improves standard GLAT model by +1.5 BLEU but decreases decoding speed by -0.90× (over GLAT-Big by +0.1 BLEU and 1.10× Speed).Surprisingly, our method can further benefit the translation quality, leading to much better performance than AT teachers (MaskT: +0.2 BLEU averagely).This emphasizes the need for scaling NAT as a standard procedure.This can be used as a new benchmark over NAT models to convey the extent of the challenges they pose.We also measure translation quality with METEOR (Banerjee and Lavie, 2005), which incorporates semantic information by calculating either exact match, stem match, or synonymy match.As shown in Table 9, the cone scaling consistently achieves the best performance.Results on more datasets are listed in Appendix §A.4.

Conclusion and Future Work
In this study we target bridging the gap of model and data scale between NAT and AT models by investigating the scaling behaviors of NAT models.We find that simply scaling NAT models (NAT-Big) can significantly improve translation performance, especially on large-scale training data.To better balance effectiveness and efficiency, we empiri- cally study the contributions of scaling encoder and scaling decoder, and find that scaling NAT is more asymmetric than AT.Based on the observations, we design a new scaling architecture with deeper and wider encoder and shallower and narrower decoder (NAT-Cone), which achieves comparable performance with NAT-Big without scarifying decoding speed.Our study empirically indicates the potential to make NAT a practical translation system as its AT counterpart.However, the SOTA NAT models (including Scaling NAT) still rely on the distillation by an AT teacher.Future work will investigate better techniques to train scaled NAT models from scratch (i.e.without distillation).We additionally experiment larger NAT models in Appendix §A.2, which can be regarded as a preliminary experiments for this.We will also explore scaling NAG models in other NLP tasks, such as keyphrase generation (Xie et al., 2022) and text-to-table generation (Li et al., 2023).The advent of large language models (LLMs) like GPT-4 has ushered in a new era in MT (Lyu et al., 2023;Jiao et al., 2023a,b;Wang et al., 2023;He et al., 2023).This innovation is causing us to reconsider conventional paradigms, especially with regards to NAT models.

Limitations
We list the main limitations of this work as follows: • Limited NAT Models.The conclusions in this paper are drawn from two representative NAT models, which may be not necessarily well suited for other NAT models.The main reason is that experiments on six WMT benchmarks have cost a large number of GPU resources.We therefore appeal to future works compare more NAT models using the new benchmarks.

Ethics Statement
We take ethical considerations very seriously, and strictly adhere to the ACL Ethics Policy.This paper focuses on empirical evaluations on large-scale datasets and scaled NAT models, which can be seen as a reality check.Both the datasets and models used in this paper publicly available and have been widely adopted by studies of machine translation.We ensure that the findings and conclusions of this paper are reported accurately and objectively.

A Appendix
A.1 Results of Depth Scaling

Main Results
We also exploit impacts of the depth scaling on NAT performance.As the National Day holiday approaches, people's holiday plans are gradually being finalized.

GLAT-Base
The National Day long holiday near, people people's plans plans gradually gradually gradually.

GLAT-Big
The National Day holiday is approaching, people's holiday plans are gradually worked out.Although Mancinelli entered elementary school, he did not graduate.

GLAT-Base
Manthinelli attended primary school at the time but but did not graduate.

GLAT-Big
Mancinelli went attended primary school at the time but did not not graduate.trate the effect of scaling on commonly-cited weaknesses of NAT, examples are listed in Table 16, Table 17 and Table 18 respectively.

A.6 Training of NAT models
We adopt Transformer-Base/Big configurations for all NAT models: both encoder and decoder contain 6 layers with 8/16 attention heads, the hidden dimension is 512/1024, and the feedforward layer dimension is 2048/4096.We train all NAT models with a big batch size of 480K.We train MaskT, GLAT models for 300K steps.We list the training budget in Table 19.More details about training hyper-parameters can be found in the training scripts of different NAT models.

Model
Size GPU Hours B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Not applicable.Left blank.
C Did you run computational experiments?
Section 3 and 4 and Appendix C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 2. D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Table 2 :
Performance (accuracy ↑) of probing tasks for evaluating linguistic properties embedded in the learned representations of AT and NAT models (Width Scaling).

Table 3 :
Analyses of translation outputs generated by NAT models on the WMT20 De→En test set.Lower repeated token percentages ("Repetition") represent lower multimodality in a model.Lower perplexities ("PPL") denote better fluency while higher word translation accuracy ("WA") denotes better adequacy.# is the absolute value and ∆ is the difference over NAT-Base models.

Table 4 :
Decoding speed (sentences/s ↑) of scaled NAT models (i.e., The possible reason is that NAT model need to

Table 8 :
Table8, we adapt the cone-scaling to Translation performance (BLEU ↑) of the proposed NAT models on translation tasks with different data sizes."Cone" denotes scaling NAT architecture to (12×1024, 3×256)."Speed" shows the speedup ratio over NAT-Base, where we measure the decoding speed in terms of S max .

Table 9 :
Translation quality of proposed NAT models in terms of METEOR (↑) on WMT20 En↔De tasks.

Table 10
A.2 Results of Larger NAT ModelsIn order to explore the upper-bound of translation performance for NAT, we enlarge models with both depth and width scaling.The model sizes are increased to 831M (MaskT) and 835M (GLAT).Re-Source 国庆 长假 临近，人们的 假期 计划 也 逐渐 敲定。 Refer.

Table 17 :
Examples about fluency for NAT models.The key spans are highlighted in red color.

Table 18 :
Examples about word accuracy for NAT models.The key tokens are highlighted in red color.

Table 19
B1. Did you cite the creators of artifacts you used?Not applicable.Left blank.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
3, Section of Limitations and Appendix A.6 C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 2.3, and Appendix A.6 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 4 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 3 and 4. D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.