POS-Constrained Parallel Decoding for Non-autoregressive Generation

The multimodality problem has become a major challenge of existing non-autoregressive generation (NAG) systems. A common solution often resorts to sequence-level knowledge distillation by rebuilding the training dataset through autoregressive generation (hereinafter known as “teacher AG”). The success of such methods may largely depend on a latent assumption, i.e., the teacher AG is superior to the NAG model. However, in this work, we experimentally reveal that this assumption does not always hold for the text generation tasks like text summarization and story ending generation. To provide a feasible solution to the multimodality problem of NAG, we propose incorporating linguistic structure (Part-of-Speech sequence in particular) into NAG inference instead of relying on teacher AG. More specifically, the proposed POS-constrained Parallel Decoding (POSPD) method aims at providing a specific POS sequence to constrain the NAG model during decoding. Our experiments demonstrate that POSPD consistently improves NAG models on four text generation tasks to a greater extent compared to knowledge distillation. This observation validates the necessity of exploring the alternatives for sequence-level knowledge distillation.


Introduction
Unlike autoregressive generation (AG) that generates tokens step-by-step, non-autoregressive generation (NAG) parallelly generates all tokens in one time step and thus the inference could be significantly speeded up (Ma et al., 2019;Ran et al., 2020;Susanto et al., 2020). Despite the computational advantage of NAG, it has faced the multimodality problem (Gu et al., 2018) caused by the conditionally independent decoding. A typical example of the problem is illustrated in Figure 1, where either * Correspondence to Wenqiang Lei. Figure 1: An example to explain "multimodality problem". The German sentence "Vielen Dank." can be translated into "Many Thanks." and "Thank you.". of "Thank you." and "Many Thanks." is the correct translation (i.e., generation modes). In this example, a mixed mode "Many you." / "Thank Thanks." will be generated by NAG. It is because the conditional dependence among target words will be broken in parallel decoding. A typical manifestation is that words are usually missing (e.g., "Many you.") and repeating (e.g., "Thank Thanks.") in NAG's sentences. To solve this problem, the key is helping NAG models to deal with various generation modes.
To date, one of the most widely used solutions is sequence-level knowledge distillation (Kim and Rush, 2016) which aims to reduce the generation modes of the raw data . Taking machine translation as an example, the knowledge distillation based methods rebuild the target sequence in the training set by employing an AG model to translate the training samples. The assumption is that the target sentences generated by one AG model tend to have less modality. Despite the success of the above studies, there are still two major limitations: (1) Most existing works mainly focus on machine translation where the performance of AG is generally assumed to be better than NAG. Clearly, such a solution will degrade the performance of NAG on the task where the AG model cannot obtain a better result. As demonstrated in our experiments (See § 4.5), there are a number of such tasks beyond the assumption like text summarization and story ending generation.
(2) The knowledge distillation based methods may cost a tremendous amount of time to rebuild a largescale training set with AG, which runs counter to the initial goal of NAG to improve the speed.
To overcome the aforementioned limitations, we explore to alleviate the multimodality problem in a different manner. In short, we aim to constrain NAG generation modes in the inference stage, rather than directly reducing generation modes in the training stage. More specifically, our basic idea is that the linguistic structure of the target sentence could be helpful to alleviate the multimodality problem. In this paper, we show that the Part-of-Speech (POS) sequence, one of most simple solutions in modeling the linguistic structure (Cutting et al., 1992), could effectively verify our idea and show promising performance in four different tasks. In more details, the proposed POS-constrained Parallel Decoding (POSPD) trains a POS predictor to obtain POS tags of target sequences. In the inference stage, POSPD constrains NAG models to choose the final outputs that satisfy the pre-specified POS sequence. As the POS predictor with a shallow decoder is separately trained, our POSPD could act as a plug-and-play method to assistant NAG models with negligible extra time. Meanwhile, it also shows the speed advantage of our method even considering the time cost in building the POS dataset, since POS tagging is much faster than sentence generating due to the small POS dictionary.
To conduct a comprehensive empirical evaluation, we examine the generalizability of POSPD by applying it to two widely-used NAG models (i.e., CMLM and DisCo) over four text generation tasks, including text summarization, story ending generation, question generation, and machine translation. Experiments demonstrate that POSPD significantly and consistently improves the two NAG models and beats the sequence-level knowledge distillation with a considerable performance gap. The main contributions of this work could be summarized as follows: • For the first time, we experimentally reveal that the implicit assumption of knowledge distillation does not always hold for the tasks (e.g., text summarization, story ending generation, as demonstrated in our experiments). In other words, AG cannot guarantee better performance than NAG, thus resulting in the undesirable performance of NAG if using knowledge distillation to alleviate the multimodality problem. This empirical result could provide novel insight to revisiting the role of the knowledge distillation in NAG.
• To alleviate the multimodality problem in various tasks, we propose POSPD by employing POS sequences to constrain the NAG generation modes in the inference stage. It is simple but effective, being able to act as a plugand-play assistant for NAG models. Such a linguistic structure based solution shows an effective and efficient alternative to the knowledge distillation paradigm in alleviating the multimodality problem 1 .

Related Works
In this section, we first analyze related works on alleviating the multimodality problem. Then, we review some representative works which introduce the linguistic structure into some text generation scenarios.

The Multimodality Problem in NAG
Recently, various attempts have been made to alleviate the multimodality problem, which can be roughly divided into two types: (1) Reducing the diversity of generation modes in training; (2) Helping models select one generation mode in inference. The first type usually trains the NAG model under the guidance of an AG model (called teacher AG), e.g., sequence-level knowledge distillation (Kim and Rush, 2016), learning from AG model's hidden state  and the curriculum learning with AG model (Liu et al., 2020d;Guo et al., 2020a). However, these methods implicitly assume that the teacher AG can achieve better performance than NAG models, otherwise it may degrade the performance of the NAG models. As two typical methods of the second type, iterative and dynamic programming methods have achieved promising performance. In short, iterative models generate the target sentence by iteratively refining the latest output (Ghazvininejad et al., 2019;Kasai et al., Figure 2: An overview of the POS-constrained Parallel Decoding 2020a; Guo et al., 2020b). Alternatively, dynamic programming methods use a heuristic searching strategy to select a better output from multiple decoded candidates (Sun et al., 2019;Saharia et al., 2020;. The biggest difference is prespecifying the linguistic structure to constrain the generation of NAG in a plug-and-play way. Extensive experiments verify the effectiveness and efficiency of our idea.

Leveraging the Linguistic Structure
Text generation involves multiple tasks, such as style transfer (Liu et al., 2020a) and text filling . Dating back to the period of statistical machine translation (Liu et al., 2006;Galley et al., 2006), linguistic structure prediction has long been investigated for it. Previous works often model and leverage syntactic structures on the decoder side, such as modeling long-distance word correspondence by syntactic dependency trees (Wu et al., 2017), implicitly incorporate linguistic prior in decoder (Eriguchi et al., 2017) and joint decoding with syntactic structure (Feng et al., 2020). In NAG, linguistic structures can also be helpful. As a global pattern of target sentence, it could serve as the complementary to the parallel decoding by helping models capture words dependency. However, directly incorporating aforementioned methods into NAG are less portable for current NAG models, since they are originally designed for AG. In comparison, POSPD can act as a plug-and-play component that uses a separate POS predictor to constrain NAG models during inference. Therefore, the NAG model can enjoy the benefits of the syntactical structure constraining while retaining its original model structure.

Methodology
In this section, we elaborate our POSPD for the NAG model. To ease of presentation, we start from a toy example to illustrate the overview of POSPD in § 3.1, and then give a detailed explanation of the implementation in § 3.2. After that, we present the training details of POSPD in § 3.3.

Overview
An overview of our POSPD method is demonstrated in Figure 2, where a toy example of machine translation is used as a showcase. To be exact, the German sentence "Vielen Dank." is fed simultaneously into both the POS predictor and the NAG model, and then the POS predictor generates a POS sequence JJ NNS PCT which is further converted into a binarized mask matrix through a conversion dictionary. Meanwhile, the NAG model generates the primary probability distributions through a softmax layer. Here, from Figure  1, words "Many" and "you" get the highest probability, resulting in the mix mode "Many you" if following the primary distribution. To avoid such an undesirable result, our POSPD automatically adjusts the probability according to the binarized mask matrix. For example, the probability of "you" is adjusted to 0, since the POS tag of "you" is PRP rather than NNS. As a result, "Many Thanks." gets the highest probability hence to be generated as the output.

POSPD in Details
In this part, we detail the POSPD by introducing the conversion dictionary building, the workflow of POSPD, and the core module-the POS predictor.

Building a Conversion Dictionary
The key idea of POSPD is filtering out the words that dissatisfy the prespecified POS sequence in the primary results of NAG. To implement our idea, we need a conversion dictionary D c that contains the mapping from POS tags to words. Given a target vocabulary V w with the length of |V w | and a POS tag set V s , each key of D c is a POS tag in V s and the value is a set of words that can be assigned to this POS tag. It is worth noting that a word may have multiple POS tags. Therefore, one word may appear in multiple sets in D c . The POSPD Workflow The workflow of POSPD is as follows: given a source sentence x, POSPD feeds it into both the NAG model's encoder and the POS predictor. After that, the POS predictor outputs a POS sequence s = (s 1 , s 2 , ..., s L ) for the target sentence. Meanwhile, the decoder of the NAG model generates a preliminary distribution matrix D = (d 1 , d 2 , ..., d L ), where d i represents the distribution of all words 2 in the i-th position. Note that, the sentence length follows the length of the predicted POS tag L.
For the ease of implementation, the POS sequence s is converted into a binarized mask matrix M = (m 1 , m 2 , ..., m L ). In details, for each POS tag s i , the corresponding binarized vector is ) and the j-th position m j i is defined as: where w j is the j-th word token in V w . As a result, the POS sequence s is replaced by M. Finally, we get the new generation results by: The POS Predictor As the core module of the POSPD, our POS predictor is dedicated to output the POS tag sequence of the target sentence when accepting the source sentence as the input.
To train the POS predictor, we need to create a POS dataset where each sample is a pair consisting of a source sentence and a POS sequence of the target 2 The length of d i is |Vw|.
sentence 3 . As shown in Figure 3, the architecture of our POS predictor is a variant of the standard Transformer (Vaswani et al., 2017). As shown in the gray arrow flow, the main difference between our POS predictor and the vanilla Transformer is the layer number of encoder and decoder. To be specific, unlike the vanilla Transformer which contains six layers for both encoder and decoder, we use a multi-layer encoder and a one-layer decoder to reduce the inference time, because the complexity for decoding the POS sequence is much lower than that for the original sentence.

POS Predictor Optimization
To optimize the POS predictor, we take a multi-task learning (Evgeniou and Pontil, 2004) paradigm to jointly decode the word sequence and POS sequence on the target side. The underlying hypothesis is that the target word sentence is highly related to the POS sequence. Given a source sentence x, a POS sequence s and a target sentence y = (y 1 , y 2 , ..., y L ), the learning objective is then defined as the sum of the POS tagging loss (the first term) and the sentence prediction loss (the second term): where the POS sentence prediction loss can be written as: and the target sentence prediction loss is: log P (y t |s <t , x).
In our method, the POS predictor uses an extra linear layer after the decoder to generate the target sentence, as shown in Figure 3. After training, we only need the POS predicting linear layer for inference, thus enjoying the better performance for the POS sequence prediction.

Training under the BPE Condition
Almost all NAG models use the Byte Pair Encoding (BPE) (Sennrich et al., 2016) technique to build the word vocabulary with subword-level tokens. However, these tokens cannot be tagged by the mainstream POS Taggers (Yarowsky and Ngai, 2001), which makes difficulties in building the POS dataset. To address this issue, we propose a simple but effective subword-level POS tagging method for our POS predictor. A simple example is demonstrated in Table 1, the NLTK toolkit tags the word "gutacht" as NN in the original sentence but cannot handle the BPE form "gut ##ach ##t". Intuitively, we can assign the BPE form to have the POS tag as "gutacht" (i.e. NN NN NN). However, this method increases the number of repeated tokens in generation sentences of NAG models and even worsens the performance. The possible reason is that the aforementioned method cannot explicitly distinguish whether a POS tag is associated with a BPE token or a complete word. In contrast, our method tags the BPE form as NN1 NN2 NN3. As a result, the conversion dictionary is more sparse while improving the mapping between the POS tag and the corresponding words. In addition, the word "question" is tagged as NN, since it doesn't have any sub-word tokens after the BPE.  Table 1: An example of the subword-level POS tagging method, where "WP" denotes the POS sequence generated by NLTK, and "SWP" is the "WP" of sub-word level. "##" denotes the subword token marker.

Experiments
In this section, we use multiple text generation datasets to comprehensively evaluate the effectiveness and efficiency of the proposed POSPD. For an extensive comparison, we compare our POSPD with the sequence-level knowledge distillation, and provide detailed analyzes in alleviating the multimodality problem and the time cost in dataset building.

Datasets
We conduct experiments on four widely-used benchmark datasets to evaluate POSPD: XSUM for text summarization, ROCStories corpus for story ending generation, SQuAD

Baselines and Comparison
In this work, we focus on using iteration-based NAG models as backbones, because they are one of the mainstream NAG structures in current works and perform competitively to AG models without any external system (Kasai et al., 2020b). Specifically, we use two representative iteration-based NAG models from recent work, i.e., CMLM (Ghazvininejad et al., 2019) and DisCo (Kasai et al., 2020a). The details are as follows: CMLM The conditional masked language model randomly masks some target tokens and predicts them with the remaining ones. In inference, it masks several tokens with the lower "confidence" and retains other tokens with higher "confidence" during iterations, which is called mask-predict inference. Following Ghazvininejad et al. (2019), we use same settings for all generation tasks 8 . DisCo The disentangled context transformer aims to use different context information when predicting each token, being regarded as an effective improvement of CMLM. For better comparison, we also use mask-predict inference as same as CMLM. Meanwhile, we use the model settings described in Kasai et al. (2020a) for all generation tasks 9 . Knowledge Distillation Following Gu et al. (2018) which uses a standard transformer (Vaswani et al., 2017) as the teacher model to regenerate training set in the greedy method for NAG models (hereinafter described as "Transformer-1 (6-6)"), we report NAG models' performances on all text generation task when using the distilled training dataset. In the following discussion, the "Transformer-1" and "Transformer-4" denote the beam size of 1 and 4 in the beam search, respectively. Meanwhile, we also report the results of different Transformer model structures, where the "(6-6)" and "(12-1)" denote the version of six encoder layers, six decoder layers and the version of 12 encoder layers, one decoder layers, respectively. 8 https://github.com/facebookresearch/ Mask-Predict 9 https://github.com/facebookresearch/ DisCo

Experimental Settings
We follow the hyperparameters for standard Transformer in (Vaswani et al., 2017) for our POS predictor. One minor difference is the layers of encoder and decoder are set to 12 and 1 to make a fair comparison with AG models, respectively. All of the models are implemented based on Fairseq (Ott et al., 2019), and we follow the other specific parameter settings for both AG and NAG models in (Kasai et al., 2020b). In inference, the length beam, length penalty, and batch size are all set to 1 to calculate the main results (without any postprocessing) and latency. The latency is calculated through using the built-in time statistics function in Fairseq, which is tested on a single NVIDIA Tesla P100 GPU to keep in line with previous works (Gu et al., 2018). Meanwhile, the beam size of our POS predictor is set to 5. For the number of iterations, we report the iterations when the NAG model results are converged. In practice, the iterations of two NAG models are 4, 3, 3 and 10 on XSUM, SQuAD1.1, ROCStories and WMT14 (DE-EN).

Main Results
We evaluate the performance of two NAG models (CMLM and DisCo) on four text generation datasets, and further provide the results when using sequence-level data distillation (i.e., "+Distill") and the POSPD (i.e., "+POSPD"), respectively. We report the main results in Table 2 and the inference  time comparison in Table 3, from which we can make the following conclusions: 1. POSPD consistently improve NAG models on four text generation dataset to a greater extent compared to knowledge distillation. POSPD consistently improve NAG models on four text generation tasks while knowledge distillation may even degrade performances of the NAG models such as XSUM (row 5 vs. row 6) and SQuAD 1.1 (row 8 vs. row 9). More importantly, although the knowledge distillation improves NAG models by 1.04/1.56 (row 5 vs. row 6, row 8 vs. row 9) on BLEU-4 in WMT14 (DE-EN), POSPD still beats the knowledge distillation version by 0.24/0.19 (row 6 vs. row 7, row 9 vs. row 10) on BLEU-4. 2. Knowledge distillation does not always improve the NAG model as the AG models may get worse performance than NAG. In both text summarization (XSUM) and story ending generation (ROCStories) tasks, the two original NAG models CMLM and DisCo outperform the AG model.   It is obvious that the adoption of sequence-level knowledge distillation limits the performance of NAG models in these case. More interestingly, in question generation, the AG model outperforms the NAG models with BLUE-4 by 0.4/0.5 (row 3 vs. row 5/row 8), but knowledge distillation degrades NAG models' performance with BLEU-4 by 0.46/0.13 (row 5 vs. row 6, row 8 vs. row 9). 3. POSPD does not bring significant extra time in constraining NAG models' generation while decoding. POSPD maintains its advantage in highspeed inference across all data sets. For example, on the dataset SQuAD 1.1, the inference latency of POSPD is much lower than NAG models (1.00× vs. 0.62×/0.66×). Meanwhile, on the WMT14 (DE-EN) that has the longest average length of the target sentence, POSPD still maintains its advantage in the inference speed. Therefore, our POSPD could constrain the NAG model with the negligible extra time, since POSPD and the NAG model predict sequences (i.e., POS sequence and target sentence) in parallel.

Further Discussions
There is a loose ending towards the discussion of our POSPD solution. In this section, we conduct discussions to shed light on other interesting properties of POSPD. The discussions are guided by the following three research questions: Q1: How does POSPD alleviate the multimodality problem?
Q2: Is it time-consuming to build the POS dataset on the new task? Q3: Does multi-tasking learning object help the POS tag prediction?

Discussion on Generated Results (Q1)
To further analyze the role of POSPD and the sequence-level knowledge distillation in alleviating the multimodality problem, we conduct further statistical analyses on the generated results of four datasets. Considering the multimodality problem usually manifests as repeating or missing tokens in the generation sentences, we use two indicators, i.e., the repetition rate and the total number of tokens, to quantify them separately. Concretely, we refer to a "single-token repeat" metric (Welleck et al., 2020) and define the repetition rate here as the percentage of the repeated times between two adjacent tokens in the total number of tokens in a sentence, and then average it over the dataset. The results are shown in Table 4, from which we can see both knowledge distillation and POSPD can reduce the repetition rate in NAG models on four datasets, and they are more effective on XSUM datasets with longer sentences. While in token numbers, using knowledge distillation significantly reduces the number of tokens generated by NAG models on XSUM. In contrast, using POSPD remarkably make the length of generated sentences by NAG models close to the reference without increasing the repetition rate.  Table 4: Statistical analysis of NAG models' generations. "Reference" denotes the target sentence's reference. "Repetition" and "Tokens" represent the repetition rate and tokens number gap between reference and model outputs, respectively.

Time Cost in Building Datasets (Q2)
Considering that both POSPD and knowledge distillation require the processing of the training dataset when it comes to a new task/dataset (i.e., building the POS data set for POSPD / regenerating the training set for knowledge distillation), we further analyze the time consumption of the two processing steps. As shown in Table 5, POSPD has a significant advantage over knowledge distillation in the time consuming of dataset building. Especially on the larger dataset WMT14 (DE-EN), it can save even more time in building datasets, which is beneficial for rapid deployment on new tasks.   Table 6: The ablation study on using multi-task learning strategy in POSPD's training stage. "MT w/o" and "MT w/" denote training the POS predictor in POSPD with/without the multi-tasking learning, respectively.

Multi-task Learning Strategy (Q3)
In this part, we analyze the impact of using a multitask learning strategy in POSPD's training stage.
For lack of space, we take the ablation study on two datasets of different sizes, i.e., SQuAD 1.1 and XSUM. The results are shown in Table 6. Interestingly, predicting the POS sequence directly from the original sentence (i.e., "POSPD w/o") can also improve the performance of the NAG models. More importantly, multi-task learning strategy can improve the performance of POSPD in two datasets with a tiny increase in model parameters (only one linear layer). Meanwhile, it is only used during the POSPD's training stage and does not affect the inference time of POSPD.

Conclusion
In this paper, we revisit the role of the knowledge distillation in alleviating the multimodality problem of NAG. In brief, we experimentally reflect that the basic assumption of these knowledge distillation methods, the AG model is superior to NAG model, does not always hold for all text generation tasks. To alleviate the multimodality problem, we show a different solution by incorporating linguistic structure into NAG. Extensive experiments demonstrate that our POSPD significantly and consistently improves the NAG models in effectiveness and computational efficacy. As we tentatively give a successful implementation of leveraging one of the simplest linguistic structures to benefit the NAG models in inference, such paradigm deserves a closer and more detailed exploration. Thus in the future, we will investigate to make the NAG models enjoy the benefits of incorporating diverse and abundant linguistic structures in a more superior way. In addition, our experimental results suggest that future work might need to consider wider ranges of generation tasks instead of only machine translation when assessing the performance of NAG models.