Thinking Clearly, Talking Fast: Concept-Guided Non-Autoregressive Generation for Open-Domain Dialogue Systems

Human dialogue contains evolving concepts, and speakers naturally associate multiple concepts to compose a response. However, current dialogue models with the seq2seq framework lack the ability to effectively manage concept transitions and can hardly introduce multiple concepts to responses in a sequential decoding manner. To facilitate a controllable and coherent dialogue, in this work, we devise a concept-guided non-autoregressive model (CG-nAR) for open-domain dialogue generation. The proposed model comprises a multi-concept planning module that learns to identify multiple associated concepts from a concept graph and a customized Insertion Transformer that performs concept-guided non-autoregressive generation to complete a response. The experimental results on two public datasets show that CG-nAR can produce diverse and coherent responses, outperforming state-of-the-art baselines in both automatic and human evaluations with substantially faster inference speed.


Introduction
Creating a "human-like" dialogue system is one of the important goals of artificial intelligence. Recently, due to the rapid advancements in natural language generation (NLG) techniques, datadriven approaches have attracted lots of research interest and have achieved impressive progress in producing fluent dialogue responses (Shang et al., 2015;Vinyals and Le, 2015;Serban et al., 2016;Li et al., 2016). However, such seq2seq models tend to degenerate generic or off-topic responses (Tang et al., 2019;Welleck et al., 2020). An effective way to address this issue is to leverage external knowledge (Zhou et al., 2018a,b) or topic information (Xing et al., 2017), which are integrated as additional semantic representations to improve dialogue informativeness.
Although promising results have been obtained by equipping dialogue models with external knowl-I like going shopping and watching tv .
What are your hobbies ?
Same. I like to sit on my couch and watch anime .
That's cool. I'm a big fan of japanese anime. I really like hearing its music . edge, the development of dialogue discourse still has its own challenge: human dialogue generally evolves around a number of concepts that might frequently shift in a dialogue flow . The lack of concept management strategies might lead to incoherent dialogue due to the loosely connected concepts. To address this problem, recent studies have combined concept planning with response generation to form a more coherent and controllable dialogue Xu et al., 2020a,b;.
Most of these approaches incorporate concepts into responses in an implicit manner, which cannot guarantee the appearance of a concept in a response. Compared with dialogue concepts, a large proportion of chit-chat words are common and usually have a high word frequency and are relatively overoptimized in language models (Gong et al., 2018;Khassanov et al., 2019). Consequently, conventional seq2seq generators are more "familiar" with these generic words than those requiring concept management, which prevents introducing certain concepts to the response with sequential decoding (either greedily or with beam search) (Mou et al., 2016). Moreover, speakers naturally associate multiple concepts to proactively convey diverse information, e.g., action, entity, and emotion (see Figure 1). Unfortunately, most existing methods can only retrieve one concept for each utterance (Tang et al., 2019;Qin et al., 2020). Another line of approaches attempt to explicitly integrate concepts into responses and generate the remaining words in both directions (Mou et al., 2016;Xu et al., 2020a), but they also fail to deal with multiple concepts.
In this paper, we devise a concept-guided nonautoregressive model (CG-nAR) to facilitate dialogue coherence by explicitly introducing multiple concepts into dialogue responses. Specifically, following Xu et al. (2020a), a concept graph is constructed based on the dialogue data, where the vertices represent concepts, and edges represent concept transitions between utterances. Based on the concept graph, we introduce a novel multiconcept planning module that learns to manage concept transitions in a dialogue flow. It recurrently reads historical concepts and dialogue context to attentively select multiple concepts in the proper order, which reflects the transition and arrangement of target concepts. Then, we customize an Insertion Transformer (Stern et al., 2019) by initializing the selected concepts as a partial response for subsequent non-autoregressive generation. The remaining words of a response are generated in parallel, aiming to foster a fast and controllable decoding process.
We conducted experiments on Persona-Chat (Zhang et al., 2018) and Weibo (Shang et al., 2015). The results of automatic and human evaluations show that CG-nAR achieves better performance in terms of response diversity and dialogue coherence. We also show that the inference time of our model is much faster than conventional seq2seq models. All our codes and datasets are publicly available. 1 Our contributions to the field are three-fold: 1) We design a concept-guided non-autoregressive strategy that can successfully integrate multiple concepts into responses for a controllable decoding process. 2) The proposed multi-concept planning module effectively manages multi-concept transitions and remedies the problem of dialogue incoherence. 3) Comprehensive studies on two datasets show the effectiveness of our method in terms of response quality and decoding efficiency.

Open-Domain Dialogue Generation
Neural seq2seq models (Sutskever et al., 2014) have achieved remarkable success in dialogue 1 https://github.com/RowitZou/CG-nAR systems (Shang et al., 2015;Vinyals and Le, 2015;Serban et al., 2016;Xing et al., 2017), but they prefer to produce generic and off-topic responses (Tang et al., 2019;Welleck et al., 2020). Dozens of works have attempted to incorporate external knowledge into dialogue systems to improve informativeness and diversity (Zhou et al., 2018a;Zhang et al., 2018;Dinan et al., 2019;Ren et al., 2020). Beyond the progress on response quality, a couple of works focus on goal planning or concept transition for a controllable and coherent dialogue (Yao et al., 2018;Moon et al., 2019;Xu et al., 2020a,b;. Most of these works mainly explore how to effectively leverage external knowledge graphs and extract concepts from them. Nevertheless, they generally introduce concepts into the response implicitly with gated controlling or copy mechanism, which cannot ensure the success of concept integration because seq2seq models prefer generic words. Some works (Mou et al., 2016;Xu et al., 2020a) try to produce concept words first and generate the remaining words to both directions to complete a response, but they cannot handle the situation of multiple concepts. By contrast, we focus on how to effectively integrate multiple extracted concepts into dialogue responses. The proposed CG-nAR applies the nonautoregressive mechanism, which can explicitly introduce multiple concepts simultaneously to responses to enhance coherence and diversity.

Non-Autoregressive Generation
Compared with traditional sequential generators that conditions each output word on previously generated outputs, non-autoregressive (non-AR) generation avoids this property to speed up decoding efficiency and has recently attracted much attention (Gu et al., 2018(Gu et al., , 2019Ma et al., 2019;Stern et al., 2019). Another relevant line of research is refinement-based generation (Lee et al., 2018;Kasai et al., 2020;Hua and Wang, 2020;Tan et al., 2021), which gradually improves generation quality by iterative refinement on the draft instead of one-pass generation. For dialogue systems, there has been prior works that attempt to improve the traditional autoregressive generation. Mou et al. (2016) explores the way of generating words to both directions, but it is still in an autoregressive manner. Song et al. (2020)  Step 0: Step 1: Step 2: Step 3: (a) Multi-Concept Planning (b) Concept-Guided Non-AR Generation consistency of dialogue generation, but it requires a specialized consistency matching model for inference. Han et al. (2020) applies the non-AR mechanism to dialogue generation, aiming to alleviate the non-globally-optimal issue to produce a more diverse response. In this work, we further use dialogue concepts to guide response generation. We customize an Insertion Transformer (Stern et al., 2019) and arrange dialogue concepts as a partial input sequence, which is different from the original setting where texts are generated from scratch. By this means, multiple concepts can be naturally introduced to the response as a guidance to foster a more controllable non-AR generation.

Methodology
The overall framework of CG-nAR is shown in Figure 2. Based on a concept graph that represents candidate concept transitions, a multi-concept planning module is designed to select and arrange appropriate target concepts from the contextually related subgraphs, which is conditioned on the previous concept flow and the dialogue context. Then, we input the selected concepts as a partial response into an Insertion Transformer (Stern et al., 2019) to parallelly generate the remaining words.

Concept Graph Construction
Inspired by Xu et al. (2020a), we build a concept graph with two steps: vertex construction 2 and 2 The original constructed vertices in Xu et al. (2020a) involve what-vertices and how-vertices, where how-vertices represent different ways of expressing response content with a multi-mapping model (Chen et al., 2019). Here, we only collect what-vertices as dialogue concepts. edge construction. Given a dialogue corpus S, we exploit a rule-based keyword extraction method to identify salient keywords from utterances in S (Tang et al., 2019). All extracted keywords are collected as dialogue concepts that represent vertices in the concept graph. For edge construction, we use pointwise mutual information (PMI) (Church and Hanks, 1989) to construct a concept pairwise matrix that characterizes the association between concepts in the observed dialogue data (Mou et al., 2016;Tang et al., 2019), where each concept pair consists of two concepts that are extracted from the context and the response, respectively. For each head vertex v h , we select concepts with top PMI scores as tail vertices v t and build edges by connecting v h with all v t s. In this way, we filter out low-frequency edges to narrow the search space for downstream concept planning.

Multi-Concept Planning Module
Given the dialogue context D, the historical concept flow F , and a concept graph G, the goal of multi-concept planning is to predict a sequence of target concepts C, namely P (C|D, F, G). All target concepts are extracted from G and arranged in a sequence C = {c 1 , c 2 , ..., c t }, which reflects the order of target concepts in the final response.
Hierarchical Dialogue Encoder. To facilitate the understanding of dialogue context D, we employ Transformer blocks (Vaswani et al., 2017) to hierarchically encode dialogue context, aiming to capture the global semantic dependency between utterances. Formally, given the dialogue context .., w in } is the word sequence of i-th utterance, we transform u i into a sequence of hidden vectors with a Transformer encoder: (1) Here, e w ij is the embedding of the j-th word in u i . h cls i and e cls i represent a special token [CLS] that is used to aggregate sequence representations, which is inspired by Devlin et al. (2019). Then, we collect utterance representations derived from [CLS] and input them into another Transformer encoder to hierarchically fuse context information: (2) h cls i is a context-aware utterance representation that can be used to guide concept selection in the following steps.
Concept Flow Encoder. Formally, a concept flow F = {f 1 , f 2 , ..., f N } represents the observed concepts in the dialogue context, where f i means a concept set corresponding to the i-th utterance that collects all the concept words in u i , namely For an empty set f i = ∅, a special NULL token is served as the concept word.
To capture information of history concept transitions, we exploit a vanilla GRU unit (Cho et al., 2014) to recursively read concept words in the flow: ( Here f i denotes the representation of concept set f i , which is calculated as a weighted sum of concept word embeddings e c ij : where W f is trainable parameters. The output state s i−1 at the i − 1 step is used as a query to compute β f ij scores, which can measure the preference of transitions to associated concepts. Empirically, s 0 is a zero vector to initialize the recurrent process, and the final output s N can serve as a memory to enable history-aware concept planning.
Multi-Concept Extractor. Recall that our goal is to produce a concept sequence C, which is a subsequence of the target response. Inspired by pointer network , we design a multi-concept extractor to achieve this goal, which can attentively read the dialogue context and the concept flow to sequentially extract target concepts from the contextually related subgraphs in G.
To implement concept extraction in a sequential decoding manner, we use a Transformer decoder and compute its decoding states as follows: The utterance representations H cls are memories for decoder-encoder attention.
[e c 1:t−1 ] denotes the embeddings of previously decoded concepts. m t is the output state at step t conditioned on the dialogue context and partially decoded outputs.
Given the decoder state m t and the concept flow memory s N , the following step is to select target concepts from G. We first retrieve a group of subgraphs that corresponds to the concept set f N of the last utterance u N to prepare for the next round of concept transition. Here, each subgraph g j consists of a hit concept c N j ∈ f N and its concept , where c head j and c tail jk represent head concept vertex and tail concept vertex, respectively. N g j means the number of vertex pairs in g j . Then, we employ a dynamic graph attention mechanism to calculate subgraph vectors g j at each decoding step t to fuse information of all concept neighbours: , e tail jk are embeddings of head and tail concepts in g j . Here, α g jk is the probability of choosing c tail jk from all concept neighbours in g j at step t conditioned on the dialogue context and the concept flow. We then compute the probability of choosing g j at step t as a top-level concept selection, denoted as α t j : where W t q and W t k are trainable parameters. Finally, the selection probability of target concepts at step t can be derived as: The multi-concept extractor has two stop conditions: 1) We add a special token c stop to the concept neighbour set of g j 3 . The extractor treats c stop as a legal candidate target, and the selection of c stop results in a stop action. 2) the number of target concepts exceeds N max . Furthermore, for all the concepts extracted at step k(k < t), we set their probabilities to 0 to avoid duplicate extraction.

Concept-Guided Insertion Transformer
After obtaining the target concept sequence C, the next step is to generate a response that covers C. General autoregressive approaches cannot ensure the success of introducing certain contents because they prefer to generate generic words (Mou et al., 2016). Given a substantially big language model, the problem might be alleviated but still cannot be completely solved. To address the issue, we use an Insertion Transformer (Stern et al., 2019) to generate a response based on C, which ensures the appearance of target concepts. On the other hand, the explicit planned concepts can be regarded as a prompt or a signal to guide the generation process. Generation is accomplished by repeatedly making insertions into a sequence initialized by C until a termination condition is met. At each decoding step t, the Insertion Transformer produces a joint distribution over the choice of words w t and all available insertion locations l t ∈ [0, |ŷ t−1 |] in the previously decoded responseŷ t−1 : p(w t , l t |D,ŷ t−1 ) = InsTF(H cls ,Ê t−1 ), whereÊ t is the word embedding list ofŷ t . Notably, y t has multiple available insertion locations, and we can perform parallel decoding by applying insertions at multiple locations simultaneously. For more details of Insertion Transformer, please refer to Stern et al. (2019)

Training and Loss Functions
Given the list of ground truth concepts C in the target response y, the concept extractor is trained as a usual sequence generation model to minimize the negative log likelihood (NLL) loss as follows: −logp(c t |c 1:t−1 , D, F, G).
(10) To train the Insertion Transformer, we first sample a subsequenceŷ containing all the target concepts from the target response y. Then, for each of the k + 1 locations l = 0, 1, ..., k inŷ, let (w i l , w i l +1 , ..., w j l ) be the span of words from the target response yet to be produced at location l. The loss function is finally defined as follows: Here w l (i) is a softmax weighting policy (Stern et al., 2019) that performs a weighted sum of the negative log-likelihoods of the words in the span. It encourages the generator to produce the central words of the span for a faster decoding process.

Datasets
Experiments are conducted on two public opendomain dialogue datasets Persona-Chat (Zhang et al., 2018) and Weibo (Shang et al., 2015). For Persona-Chat, the associated persona information is discarded so that the model can focus on the development of dialogues. Following previous works (Tang et al., 2019;Xu et al., 2020a), we employ a rule-based method to automatically extract concept words of each utterance, which combines tf-idf and POS features for scoring word salience. After dataset cleaning, we re-split the Persona-Chat dataset into train/valid/test sets as done in Tang   random. After constructing the graph of Persona-Chat, we randomly sample 100 concept vertices and 200 edges and ask three human annotators to evaluate their appropriateness. About 93% vertices and 72% edges are accepted by the annotators. For the graph of Weibo, we use the graph released by Xu et al. (2020a). Statistics of the two dialogue datasets along with the constructed graphs is shown in Table 1.

Comparison Methods
We compare CG-nAR with two groups of baselines: general seq2seq models and concept-guided systems. General seq2seq models produce responses conditioned on the dialogue messages without concept planning, including: Seq2seq+Att (Sutskever et al., 2014), a standard RNN model with attention mechanism; Transformer (Vaswani et al., 2017), a seq2seq model with a multi-head attention mechanism; HRED (Serban et al., 2016), a hierarchical encoder-decoder framework to model context utterances; ReCoSa , a state-of-the-art model using the self-attention mechanism to measure the relevance of response and context. Concept-guided dialogue systems leverage concept information to control response generation, including: Seq2BF (Mou et al., 2016), a non-left-to-right generation model that explicitly incorporates a keyword into the response; CCM (Zhou et al., 2018a), a model that uses the graph attention mechanism to choose graph entities 4 , and introduces them into response implicitly by a copy mechanism; ConceptFlow , a state-of-the-art model that grounds each dialogue in the concept graph and traverses to distant concepts, which also generates concept words implicitly in an autoregressive manner; CG-nAR (our model), a model that explicitly introduces multiple concepts into responses with non-autoregressive generation.

Implementation Details
We used VGAE (Kipf and Welling, 2016) to initialize the representation of concept vertices in the concept graph, and used Word2Vec (Mikolov et al., 2013) to initialize word embeddings. The embedding size of vertices and words was set to 128 and 300, respectively. We employed Adam (Kingma and Ba, 2015) with learning rate 1e-3 to train the concept extractor and the Insertion Transformer. All Transformer blocks have 3 layers, 768 hidden units, 8 heads, and the hidden size for all feed-forward layers is 2,048. The hidden size of GRU cells is 768. At inference time, the multiconcept extractor produces concepts greedily, and the maximum number of allowed concepts N max was set to 5. For the Insertion Transformer, we used the configuration that achieved the best results reported in Stern et al. (2019). The whole model was trained for 100,000 steps with 8,000 warm-up steps on a 3090 GPU. Checkpoints were saved and evaluated on the validation set every 2,000 steps. Checkpoints with the top performance were finally evaluated on the test set to report final results.

Automatic Evaluation
We adopt widely used BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) to measure the relevance between the generation and the ground-truth. We report averaged BLEU scores with 4-grams at most and ROUGE-1/L (RG-1/L) F-scores. To measure the diversity of generated responses, we report  Table 3: Results of manual evaluation with appropriateness (App.) and informativeness (Inf.). The score is the percentage of times CG-nAR is chosen as the better in pairwise comparison with its competitor. Results marked with † are significant (using sign test, p<0.05).
the ratio of distinct uni/bi-grams (Dist-1/2) in all generated responses . Table 2 shows the results of automatic evaluation for CG-nAR and baseline methods. All methods can be categorized into two groups: traditional seq2seq based generators and concept grounded methods. CG-nAR outperforms all other baselines significantly on BLEU and ROUGE scores (using Wilcoxon signed-rank test, p< 0.05), which manifests that the responses generated by CG-nAR match better with the ground-truth responses. This means CG-nAR can maintain the dialogue flow on-topic by the multi-concept planning mechanism. In terms of Dist-1/2 that measures the response diversity, all methods with concept planning can produce more diverse responses than those without, which indicates the problem of generic responses is alleviated by integrating concept information. Compared to the baselines with concept planning, CG-nAR has a better performance on response diversity. It verifies the effectiveness of our multiconcept planning module and the concept-guided non-autoregressive strategy, which can produce and combine multiple context-related concepts to compose diverse responses and keep concept words in the output response in an explicit manner.

Manual Evaluation
Considering automatic metrics may not suitably reflect the content to be evaluated, we further performed manual evaluation following previous works (Zhou et al., 2018a;. Specifically, we randomly sampled 200 testing pairs from each test set and employed three annotators with professional background knowledge to evaluate the responses. Given a dialogue message,  annotators were required to conduct pair-wise comparison between the response generated by CG-nAR and the one by a baseline (1,600 comparisons with four baselines on two datasets in total). For each comparison, annotators decided which response is better in terms of appropriateness (the model's ability to produce a fluent, coherent, and context-relevant response) and informativeness (if the response provides diverse information). For appropriateness, the percentage of pairs that at least 2 annotators gave the same judge (2/3 agreement) is 95.8%, and the percentage for 3/3 agreement is 62.7%. For informativeness, the at least 2/3 agreement is 89.0% and 3/3 agreement is 56.2%. We compare CG-nAR against four baselines on Persona-Chat and Weibo (see Table 3). The score represents the percentage of times CG-nAR is chosen as the better in pair-wise comparisons. For appropriateness, CG-nAR significantly outperforms all other baselines on two datasets (using sign test, p < 0.05). It means that CG-nAR can generate more context-relevant and coherent responses accepted by annotators, which validates the effectiveness of our multi-concept planning module. In terms of informativeness, the percentages that CG-nAR wins ReCoSa are noticeably higher than those against other baselines. It indicates that systems with a concept planning mechanism can produce more informative responses by content introducing.

Analysis of Multi-Concept Planning
To validate if the multi-concept planning module has the ability to extract context-relevant concepts and form a coherent dialogue, we calculate the precision, recall, and F1 score of predicted concepts against golden ones in responses (Concept-   P/R/F1). We also record the average number of predicted concepts to measure the model's ability to introduce multiple concepts. From Table 4 we can observe that CG-nAR achieves a higher recall and F1 score against all baselines by a large margin, especially for ReCoSa and Seq2BF. It probes that our concept planning module can successfully extract more concepts relevant to the dialogue. This is also reflected in the number of predicted concepts, where CG-nAR produces more concept words than those methods with autoregressive generators, e.g., CCM and ConceptFlow. It indicates that the concept-guided generator can effectively keep the concept information in output responses using a non-autoregressive generation mechanism.

Ablation Study
We perform ablation studies to validate the effectiveness of each part of CG-nAR. Table 5 shows the results. One of the variants is a vanilla Insertion Transformer where the concept planning module is removed. The model performance unsurprisingly degrades by a large margin, because the model might produce generic responses without concept planning. After removing the concept flow encoder, the information of historical concept transitions is missing, which also leads to a performance drop. We further replace the hierarchical dialogue encoder with a vanilla Transformer encoder, the performance drop shown in Table 5 indicates that it is necessary to capture the context dependency information when performing dialogue modeling.
To probe the effectiveness of the concept-guided non-autoregressive strategy, we replace the Insertion Transformer with a universal Transformer framework equipped with a gated controller as done in , where the generation probabilities are calculated over the word vocabulary and the set of selected concept words. Table 5 shows that with the autoregressive decoding strategy, the performance drop is significant. A possible explanation is that the appearance of some key concepts cannot be guaranteed by such an implicit concept-oriented generator, especially when the generator encounters concepts that are not frequently seen in the training set.

Speed Comparison
Our concept-guided non-autoregressive generation model shows not only the superiority on response quality, but also gives a significant speed-up at test time over the methods equipped with autoregressive generators. The results of speed comparison is shown in Table 6. For a fair comparison, we choose the baselines with a Transformer encoderdecoder framework, since our customized Insertion Transformer uses the same model components. The main advantage of the insertion-based generator at inference time is that we can predict words at different insertion locations simultaneously. From Table 6 we can see that CG-nAR achieves substantially test-time speed-up compared to the two autoregressive generators (up to 2.7x in total time and 1.6x in word generation rate) even when CG-nAR has more parameters 5 .

Case Study
To compare different models intuitively, we show two dialogue cases of the Persona-Chat dataset with output responses in Table 7. We observe that CG-nAR can successfully output contextassociated concepts, e.g., grow vegetable that is related to garden, and singer that is related to country music. Compared to other baselines, CG-nAR produces a response that is more coherent and relevant to the dialogue context, and shows a more  natural transition of concepts, which again proves the effectiveness of our concept-guided non-AR strategy for controllable dialogue generation.

Conclusion
In this work, we propose a novel concept-guided non-autoregressive approach for open-domain dialogue generation. It consists of a multi-concept planning module that selects multiple contextrelevant concepts to facilitate a coherent dialogue, and a customized Insertion Transformer that produces a response based on the selected concepts to control the generation process. The experimental results show that our method can not only produce high-quality responses, but can also significantly speed up the inference time.