GLAT: Glancing at Latent Variables for Parallel Text Generation

Recently, parallel text generation has received widespread attention due to its success in generation efficiency. Although many advanced techniques are proposed to improve its generation quality, they still need the help of an autoregressive model for training to overcome the one-to-many multi-modal phenomenon in the dataset, limiting their applications. In this paper, we propose GLAT, which employs the discrete latent variables to capture word categorical information and invoke an advanced curriculum learning technique, alleviating the multi-modality problem. Experiment results show that our method outperforms strong baselines without the help of an autoregressive model, which further broadens the application scenarios of the parallel decoding paradigm.

Amount of researchers devoted themselves to improve the NATs' inferior generation quality. Such as modeling word inter-dependencies by curriculum learning (Guo et al., 2020a;Liu et al., 2020) or iterative refinements mechanism (Ghazvininejad * Shujian Huang is the corresponding author. † Work is done while at ByteDance AI Lab. ‡ The implementation of latent-GLAT will be released at https://github.com/baoy-nlp/Latent- GLAT. et al., 2019;Guo et al., 2020b), introducing latent variables to decompose target sentences and serve as the springboard for decoding (Shu et al., 2019;Ma et al., 2019;Bao et al., 2021), and introduce inductive bias for models' training (Wei et al., 2019;. The most successful method is the glancing transformer (GLAT, Qian et al., 2021a), which trains the NAT model by sampling partial target words as inputs to predict the remaining target words, explicitly building dependencies between the observed and unobserved words. Qian et al. (2021b) employ GLAT to achieve impressive results on the translation task of WMT21 1 , even outperforming many strong autoregressive translation systems in BLEU score (Papineni et al., 2002).
Although existing NAT models achieve competitive results compared to autoregressive models in translation tasks, it is not negligible that they still need the help of an autoregressive Transformer (AT, Vaswani et al., 2017) as a teacher for training, i.e., sequence-level knowledge distillation (Kim and Rush, 2016). A well-recognized explanation is a multi-modality problem Sun and Yang, 2020): each input may have multiple valid outputs in datasets, which will prevent NAT models from learning to organize consistent outputs. Training with the outputs of an AT can directly bypass the multi-modal phenomenon in the dataset, effectively improving the models' performances.
However, training NAT models by knowledge distillation are limited. First, it needs to train an extra AT model, which inevitably enlarges the training cost. Second, it is hard to promise that the teacher (or AT) model can be accurate enough in all text generation settings, which will become the bottleneck for its student NAT model. Therefore, training a model from scratch without the help of an AT model is still an open and interesting problem.
In this paper, we propose latent-GLAT, which can directly learn from the raw dataset. It alleviates the multi-modality problem following a divide-andconquer spirit, introducing a small set of discrete latent variables to capture the target word categorical information and divide the origin goal into latent variables modeling and sentence reconstruction. First, the categorical information may have fewer multi-modality phenomena than the original words, thus can be learned directly without the help of knowledge distillation. Second, the word categorical information is informativeness to the sentence reconstruction. We can extend glancing training with these discrete latent variables for modeling the sentence, encouraging the model to build dependencies on word categorical information rather than words, which works more robustly.
Experiment results on WMT14, Quora, and Dai-lyDialog datasets show that latent-GLAT achieves remarkable improvements over several strong baselines, verifying the effectiveness of latent-GLAT. More impressively, latent-GLAT even outperforms autoregressive models in Quora and DailyDialog datasets, further validating our motivation for removing knowledge distillation. In-depth analyses indicate that the introduced discrete latent variables are helpful to alleviate the multi-modality problem and are necessary for performance improvement.

Background
For a sequence-to-sequence task of predicting sequence Y = (y 1 , y 2 , · · · , y m ) given its input sequence X = (x 1 , x 2 , · · · , x n ), the classical autoregressively factorization decomposes the p(Y |X) with a series of conditional probability: where y <t = (y 1 , y 2 , · · · , y t−1 ). Although such factorization achieved great success in previous studies (Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017), they predict each word 2 based on the prefix words, which may suffer from the issues of error accumulation and slow decoding during inference.
Non-autoregressive Transformer. To tackle the above problems, Gu et al. (2018) firstly propose non-autoregressive Transformer (NAT), introduc-ing a non-autoregressive factorization as: where each word y t are modeled independently. During inference, the NAT model can decode the word simultaneously by arg max yt p(y t |X) for each y t , remarkably improving the efficiency (15× speedups to an autoregressive Transformer). However, the independence assumption may prevent the NAT model from leveraging the inherent word dependencies to organize consistent outputs. Due to this,the efficiency improvements of NAT are at the cost of its quality, e.g., the performance degradation by more than 10.0 BLEU (Papineni et al., 2002) points in machine translation tasks (Gu et al., 2018). Besides, recent studies Sun and Yang, 2020) point out that the multimodality phenomenon in the dataset aggravates the challenge of NAT models.
Glancing Transformer. To mitigate the issue of missing word dependency in NAT models, Qian et al. (2021a) propose Glancing Transformer (GLAT), introducing glancing training (GLT) and sampling partial target tokens for training NAT: where Y obs is the partial target tokens, and Y obs is its complements set. It progressively decreases the sampling ratio and obtains better performances in machine translation tasks. Nevertheless, we find that GLAT in experiments still has a multi-modality problem 3 : First, its sampling rate cannot be decreased to zero during training, which exists the issue of exposure bias. Second, it still heavily relies on a teacher model for further improvements (Qian et al., 2021a). where p LT (Y |X) is always trained by variational inference (Ma et al., 2019) or discretization techniques (Kaiser et al., 2018). Such latent variables are decomposed from the target sentence, which is informative to determine the mode of the sentence and alleviates the multi-modality problems.
Although Latent Transformer models improve performance in terms of BLEU score, their used autoregressive predictor (Kaiser et al., 2018;Bao et al., 2021) or deep iterative transformation (Shu et al., 2019;Ma et al., 2019) for predicting latent variables unavoidable sacrifice the overall decoding efficiency. Besides, they do not explicitly build the interdependencies among the outputs.

Proposed Method: latent-GLAT
In this section, we present latent-GLAT. latent-GLAT follows Latent Transformer models (Kaiser et al., 2018;Bao et al., 2021) but introduces glancing training (Qian et al., 2021a) with the discrete latent variables. Our intuitions are as follows: First, compared to the words, the introduced discrete latent variables may have fewer modes than words and be informative to determine the modes of the sentences. In such a case, we can directly learn the discrete latent variables by the Glancing Transformer (Qian et al., 2021a), keeping competitive inference efficiency. More importantly, we can employ the latent variables to invoke glancing training for modeling the target sentences, which is informative enough to reduce the multi-modality problem of original sentences. Besides, glancing at latent variables also works robustly due we can obtain the latent variables during inference.

Introducing Discrete Latent Variables for Modeling Target Categorical Information
In this part, we state the structure of latent-GLAT, which introduces a small set of discrete latent variables for a NAT model, Let K be the size of the discrete latent space and let [K] denote the set {1, 2, · · · , K}. For each target sentence Y = (y 1 , y 2 , · · · , y m ), we use a same-length latent variable sequence for modeling it as: where z = (z 1 , z 2 , · · · , z m ) and z i ∈ [K], θ is the model parameters. Discretization. For discretizing target sentences to latent variables, we use vector quantization , which works by dividing a large set of origin vector representations into small groups. We assign each token y i with a group j ∈ [K] that has the nearest distance to its representation: where q ∈ R K×d model is the maintained representations and d model is its dimension. We use the embedding as repr(y i ), refer to Bao et al. (2021). Finally, the model is trained to minimize where L WP and L LP are the prediction loss for words Y and latent variables z, respectively. The maintained representations q are updated with an exponential moving average over a minibatch of target tokens {y 1 , · · · , y i , · · · }: where c j is assigned count for group j, and we set decay parameter λ = 0.999 in our experiments.  Encoder), a latent predictor F LP (NAT Predictor), and a decoder F DEC (Mix. Decoder). We parameterize them with the multi-head attention-based encoder or decoder, similar to Transformer (Vaswani et al., 2017). Their functions can be formalized as: where we use an extra module F LEN to predict the target length m and initialize the decoder inputs H = (h 1 , h 2 , · · · , h m ) with the softcopy (Wei et al., 2019) mechanism.

Glancing at Discrete Latent Variables for Parallel Sequence Decoding
The small number (K < 128) of discrete latent variables can capture high-level categorical information of the target words, supporting better learning design for parallel sequence decoding. Our first insight is that we can learn to nonautoregressively predict the discretized latent variables directly without the help of distillation. Specifically, we parameterize the F LP in a nonautoregressive fashion and use a glancing training technique (GLT, Qian et al., 2021a) for optimizing it, as shown in Figure 2a: where z obs is uniformly sampled from z, refer to Qian et al. (2021a). We provide more training details of latent-GLAT in Appendix B. Our next insight is modeling the sentence based on the sampled latent variables z obs rather than z, namely, glancing at z obs for optimizing F DEC : We find Eqn. (10) works robustly in experiments and analyze it in Section ( § 4.3). As shown in Figure 2b, we eventually employ words to invoke glancing training for minimizing L WP , namely we optimize the F DEC by minimizing where Y obs and z obs are the sampled target tokens and discrete latent variables.
Overall Training Loss. Our full-fledged loss includes latent variable prediction, sentence reconstruction, and length prediction losses: where α = 0.1 are the hyperparameters to adjust the importance of length prediction loss L LEN .

Inference
In inference phase, latent-GLAT predicts the target length, latent variables, and sentence in turn. For the target length, latent-GLAT first predicts the target length m with the length predictor F LEN . To avoid the length prediction errors during inference, latent-GLAT expands the length m to a ranges (we use [m − 3, · · · , m + 2], total six candidates in our experiments).
Then, latent-GLAT predicts the latent variableŝ z with arg max z p θ (z|X) and sentenceŶ with arg max Y p θ (Y |ẑ, X) for each candidate.
Similar to Ma et al. (2019), latent-GLAT also ranks the candidates by itself (self-reranking) and chooses the highest score output with: where γ is the length penalty ratio to avoid the length bias, and |Y | denotes the length of Y .
We conduct experiments on several generation tasks, including machine translation, paraphrase generation, and dialog generation.

Experimental Setup
Dataset. We chose the most popular benchmarks for each task: • Machine Translation (MT): We follow previous practices in NAT models and use the WMT14 English (EN) ↔ German (DE) corpus (4.5M sentence pairs) and the IWSLT14 German (DE) → English (EN) corpus (160K sentence pairs) to validate our proposed model. We obtain the datasets following the instruction open-sourced in fairseq 4 . In detail, we first tokenize the datasets with Moses script. Then, we use 37,000 and 10,000 operations to split the words into byte-pair encodings (BPE, Sennrich et al., 2016) in WMT14 and IWSLT14 datasets, respectively. We also share subword embeddings between the source and target language for each dataset. • Paraphrase Generation (PG): We use the Quora 5 dataset to evaluate the paraphrase generation task. The Quora dataset contains around 135K labeled paraphrases pairs. Following the standard dataset split, we sample 100K sentence pairs from the labeled paraphrases as training data and hold out 30K pairs for testing, the remaining about 5K pairs for validation. Like the MT tasks, we tokenize the corpus with Moses scripts and split the words into BPE units with total 32K operations. • Dialog Generation (DG): We conduct the dialog generation experiments on the DailyDialog dataset (Li et al., 2017). We obtain the processed DailyDialog dataset from Bao et al.
(2020) 6 . The training set contains 87,170 sentence pairs (11,118 dialogues). The validation and testing set in the dataset contain 8069 pairs (1000 dialogues) and 7740 pairs (1000 dialogues), respectively. Note that these tasks emphasize different aspects. The task of MT aims to transfer bilingual sentences with semantically invariant conditions. The PG task differs from machine translation and works on mode transformation in the same language, whose goal is to synthesize a sentence different from the original input but conveys the same meaning. The DG task is most challenging due to the complex generation goal.
For machine translation tasks, we use the base setting (d model = 512, d hidden = 2048, dropout = 0.1, n head = 8, and n layer = 6) of Transformer (Vaswani et al., 2017) for WMT14 dataset and a smaller setting (d model = 512, d hidden = 1024, dropout = 0.3, n head = 4, and n layer = 6) for IWSLT14 dataset. The number of layers in latent-GLAT decoder and latent predictor are both set to 4 in experiments. We use inverse square root learning rate scheduling for WMT14 and a linear annealing learning rate from 3.0 × 10 −4 to 1.0×10 −5 in 250K steps for IWSLT14. The models are optimized with Adam (Kingma and Ba, 2015) optimizer (β 1 = 0.9, β 2 = 0.999) in 300K steps for WMT14 and 250K steps for IWSLT14. As for the ratio τ that used in glancing sampling, we linear anneal the ratio from 0.5 to 0.3 in whole training steps. The mini-batch in each step consists of 2K tokens for IWSLT14 and 64K tokens for WMT14.
Since the scale of the Quora and DailyDialog datasets are close to the IWSLT14, we keep the same setting to the IWSLT14, such as the Adam, learning rate (linear annealing from 3.0 × 10 −4 to 1.0 × 10 −5 ), and batch size (2K tokens).
Evaluation. To validate the effectiveness of our proposed method, we evaluate it in terms of quality and efficiency. We use tokenized and cased BLEU scores (Papineni et al., 2002) 7 to evaluate the generation quality of MT and PG tasks. For dialog generation, we also include BLEU-1 and BLEU-2 scores for analysis. Following the common practices (Gu et al., 2018;Qian et al., 2021a), we measure the decoding latency of each model by decoding sentence by sentence and compute the speedup compared with the autoregressive Transformer (AT) model to reflect its decoding efficiency. We highlight the best NAT result.

Main Results
We can see from Table 1 that our latent-GLAT almost outperforms all the NAT baselines (NAT and GLAT) in generation quality on all tasks while keeping a competitive decoding speedup to the autoregressive counterpart.
Machine Translation. As seen, without the help of an AT model for training, the vanilla NAT and advanced GLAT model only obtain inferior generation quality. In contrast, latent-GLAT achieves competitive generation quality in machine translation tasks, indicating that the introduced latent variables effectively reduce the multimodality issue and support glancing training well. It narrows the performance gap between nonautoregressive decoding and autoregressive decoding from 11.46 (GLAT vs. AT) to 2.34 (latent-GLAT vs. AT) BLEU points on WMT14 EN→DE task while keeping a high-speed decoding efficiency.
Paraphrasing. Unlike the translation task, the performance gap between non-autoregressive and autoregressive decoding on the paraphrase generation task is minor (NAT vs. AT, −3.32 BLEU points, GLAT vs. AT, −0.96 BLEU points ). Nevertheless, introducing discrete latent variables still is helpful to obtain a better performance. latent-GLAT realizes a non-autoregressive model with better performance than the autoregressive model on Quora (latent-GLAT vs. AT, +1.14 points).
Dialog Generation. We can see a different trend on the DailyDialog dataset -an AT model performs poorly than NAT models. Both GLAT and latent-GLAT outperform the AT model in BLEU-1, BLEU-2, and BLEU scores, indicating that these models recall more reference tokens and organize the tokens well. We conjecture that the weak and indirect association between the inputs and outputs of the dialogue   results in this unusual phenomenon. Specifically, the weak connection may encourage the AT model to predict the tokens by paying more attention to their history outputs, which degenerate to a targetside language model. In contrast, the NAT models do not have this fast track, pushing them to pay more attention to the inputs and recall more target tokens. We further find that there are so-called safe response (Li et al., 2016) in AT's outputs, which verify our conjecture.
More Comparisons. we further compare the advanced NAT models that builds upon latent variables or iterative refinement in machine translation tasks: • NATs w/ latent variables: LV-NAR (Shu et al., 2019), SynST (Akoury et al., 2019), Flowseq (Ma et al., 2019), and CNAT (Bao et al., 2021). • Iterative NATs: CMLM (Ghazvininejad et al., 2019) and LevT (Gu et al., 2019). Table 2 shows that introducing latent variables (LV-NAR, Flowseq, and CNAT) or decoding with multiple iterations (CMLM and LevT) both improve non-autoregressive decoding in translation quality. However, iterative refinements or deep transformations always sacrifice decoding efficiency. In contrast, the proposed latent-GLAT outperforms all NAT models with a relatively low cost, keeping a competitive speedup over autoregressive Transformer (AT). Specifically, latent-GLAT with one-pass decoding narrows the performance gap to the AT from 5.87 BLEU points to 2.34 BLEU points on the WMT14 EN→DE test set.
Decoding efficiency. We can see there is a tradeoff between the translation quality and decoding efficiency in Table 2. We thus present the scatter plot of different models in Figure 3, showing the trend of translation quality and decoding efficiency.
As seen, latent-GLAT is located on the top-right of the baselines. It outperforms the baselines in the BLEU score if decoding speedup is fixed and in decoding speedup if the BLEU score is fixed.

Analysis
We now turn to verify our intuition that latent-GLAT can alleviate the multi-modality problem.
latent-GLAT largely alleviates the sentencelevel multi-modal problem. Previous researches (Gu et al., 2018;Ma et al., 2019;Qian et al., 2021a;Bao et al., 2021) always utilize a Transformer model as a teacher for training NAT models, namely sequence-level knowledge distillation (Kim and Rush, 2016), which can   directly reduces the sentence-level multi-modal phenomenon in datasets. Therefore, we use the average gains from the knowledge distillation to reflect the ability of the NAT models to overcome this issue. As seen in Table 3, the pure NAT models heavily rely on knowledge distillation. By introducing the target information with the latent variables (Flowseq and CNAT) or sampled tokens (GLAT), the NAT models improve its' ability to overcome the multi-modality issue. Our proposed latent-GLAT well combines the above two techniques. It obtains only 0.95 BLEU points average gains and validates our motivation.
Discrete latent variables have fewer modes than raw sentences. To validate our intuition that the introduced latent variables are easier to predict than tokens, we refer to  to compute the complexity metrics on each dataset according to alignment relations. Specifically, we use the fast_align 8 toolkit to align source input X and target outputs Y or discretized latent variable se-   quences z. Then, we compute the token-level complexity C TOK (d) and the sentence-level complexity C SEN (d) according to . These metrics can trivially understand as the number of valid candidates for each input.
As shown in Table 4, the latent variables have the lowest complexity in both token-level complexity and sentence-level complexity. In other words, predicting the latent variable sequences is effortless than predicting others, which is consistent with our intuition. Although we obtain a lower complexity dataset by filtering the datasets with an autoregressive model (AT outputs versus Raw outputs), they may introduce model error and need extra training for AT model. In contrast, the discrete latent variables are simple and informative enough to serve as a springboard for modeling target sentences.
Glancing with latent variables improves the performance with a large margin. We can see in Table 5 that introducing latent variables both obtain performance gains to their counterpart (L#2 vs. L#1, +0.83 points, and L#4 vs. L#3, +1.69 points). As expected, the gains are largely improved while adopting the glancing training with discrete latent variables (L#5 vs. L#1, +9.75 points), which already outperforms glancing training with the reference token (L#5 vs. L#4, +3.55 points). Finally, we jointly perform glancing training with the reference tokens and discrete latent variables, achieving the best result (L#6 vs. L#1, +11.04 points).  Figure 4: BLEU scores of latent-GLAT using different length penalty ratios on the WMT14 EN→DE valid set. We search the length penalty ratio γ for latent-GLAT while fixing the K = 64.
Effects of K and γ. As shown in Figure 4 and Table 6, we search the hyper-parameter of latent-GLAT that the number of discrete latent variables and the length penalty ratio γ according to the validation performance. We notice that using more latent codes causes performance degradation during inference, in which the latent variables may degenerate to tokens and contains more prediction error during inference. The latent-GLAT implemented with 64 latent variables and γ = 1.1 obtains the best result on WMT14 EN→DE valid set. Gu et al. (2018) first propose a non-autoregressive Transformer (NAT) model for neural machine translation (NMT) and begin to explore parallel decoding. It abandons explicitly modeling word interdependencies to decode the tokens in parallel, significantly improving the inference speed. However, its translation quality is inferior to the Transformer (Vaswani et al., 2017).

Related Work
To alleviate this performance degradation, many researchers work to enhance word dependency modeling, including imitation learning (Wei et al., 2019;, curriculum learning (Guo et al., 2020a;Liu et al., 2020), iterative refinements (Lee et al., 2018;Ghazvininejad et al., 2019;Gu et al., 2019;Guo et al., 2020b;Huang et al., 2022), and a simplified autoregressive process . The most representative method is the glancing transformer model (Qian et al., 2021a), which adaptively and progressively samples partial tokens as inputs and predicts the remaining tokens, effectively establishing the dependencies between the sampled tokens and the remaining tokens. However, these models still rely on a teacher for training, which cannot directly learn the raw dataset that contains one-to-many multi-modality phenomenon.
Introducing latent variables (Bao et al., 2019(Bao et al., , 2021 to organize the target sentence is also a helpful route. Among them, our method is close to Kaiser et al. (2018) Bao et al. (2021). These methods decompose the latent variables (hints) from the target sentence and divide the origin goal into two parts: modeling latent variables and modeling the target sentences based on latent variables. It implicitly overcomes the multimodality phenomenon of target sentences because the latent variables can largely determine the mode of the sentence. However, these methods always model the latent variables with an autoregressive predictor, which naturally sacrifices the decoding efficiency.
Unlike them, our approach models the discrete latent variables in a non-autoregressive fashion and extends glancing training with the discrete latent variables. As a result, latent-GLAT accomplishes a competitive performance both in decoding efficiency and quality.

Conclusion
We propose latent-GLAT, which can be directly trained without the help of knowledge distillation. Specifically, we employ discrete latent variables to capture the word categorical information and divide the original goal into the latent variables modeling and word prediction tasks. Then, we learn each task with the glancing training and encourage the model to build dependencies on the latent variables, which have fewer modes than the words and are also informative for modeling the target sentences. Experiments results on machine translation, paraphrase generation, and dialogue generation tasks validate the effectiveness of our latent-GLAT.