Syntactically Diverse Adversarial Network for Knowledge-Grounded Conversation Generation

Generative conversation systems tend to pro-duce meaningless and generic responses, which signiﬁcantly reduce the user experience. In order to generate informative and diverse responses, recent studies proposed to fuse knowledge to improve informativeness and adopt latent variables to enhance the diversity. How-ever, utilizing latent variables will lead to the inaccuracy of knowledge in the responses, and the dissemination of wrong knowledge will mislead the communicators. To address this problem, we propose a Syntactically Diverse Adversarial Network (SDAN) for knowledge-grounded conversation model. SDAN contains an adversarial hierarchical semantic network to keep the semantic coherence, a knowledge-aware network to attend more related knowledge for improving the informativeness and a syntactic latent variable network to generate syntactically diverse responses. Addition-ally, in order to increase the controllability of syntax, we adopt adversarial learning to decouple semantic and syntactic representations. Experimental results show that our model can not only generate syntactically diverse and knowledge-accurate responses but also signif-icantly achieve the balance between improving the syntactic diversity and maintaining the knowledge accuracy.


Introduction
Nowadays, conversation generation has become a research hotspot because of its wide application, such as voice assistant, customer service assistant and chat robot . The goal of conversation model is to generate diverse and informative responses like human. Although the existing models have achieved promising performance, they still suffer from generating general and meaningless responses (Wu et al., 2020), which significantly disrupt the user experience. Consequently, it is * Jinan Xu is the corresponding author. very crucial and urgent to generate high-quality responses.
To generate high-quality responses, many researches have been proposed to improve informativeness or diversity of responses. For informative responses, some early studies utilize context information to the decoding process (Sordoni et al., 2015;Yao et al., 2015). After that, researchers extract topic information from context (Hedayatnia et al., 2020) or add external topic to the decoder (Xing et al., 2016(Xing et al., , 2017. Lately, researchers focus on fusing knowledge into conversation model (Ghazvininejad et al., 2018;Zhou et al., 2018;Wu et al., 2020;. Although the knowledge-grounded model can generate informative responses with accurate knowledge, which may generate responses that lack diversity. For diverse responses, previous studies generally adopt beam search algorithm (Li et al., 2016b) and its variants to improve diversity (Vijayakumar et al., 2016). In recent year, latent variables are widely used in conversation model, and can significantly enhance the diversity (Serban et al., 2017;Zhao et al., 2017;Park et al., 2018;Shen et al., 2019;Ruan et al., 2019;, and generative adversarial networks (GAN)  and reinforcement learning (RL) (Sankar and Ravi, 2019) are also adopted to generate diverse responses. Although the introduction of hidden variables can increase diversity while maintaining semantic consistency, it may lead to inaccuracy in decoding specific knowledge, because the latent variables may generate semantically similar responses with a certain probability. For example, as shown in Table 1, there is a song name (Be Your Girl All Your Life) in query, where the response R1 will be generated by the variational latent model. In R1, the song name may be decoded as Be Your Woman in The Next Life, which is another song name. Then, the wrong responses will be generated. How to improve diversity of responses and preserve the accuracy of knowledge simultaneously is a huge challenge in knowledge-grounded conversation generation.
To tackle this challenge, we propose a Syntactically Diverse Adversarial Network (SDAN) for knowledge-grounded conversation generation. First, we utilize a hierarchical network to model the semantic information of context and an adversarial network to prevent semantic information from affecting syntactic information. Next, we adopt a knowledge-aware network to represent the knowledge related to the query, which takes attention mechanism to capture more important knowledge. Then, we design a syntax encoder to model syntax information and use a latent variable to keep the syntactic diversity. Finally, the encoded knowledge, syntax and context are concatenated together to initialize the decoder. Additionally, we employ adversarial network to keep the separation of syntax and semantics to prevent their mutual influence. The results of experiments on KdConv datasets show that our model can achieve better trade-off between improving diversity and maintaining knowledge accuracy than baselines.
Our main contributions are as follows: • To best of our knowledge, we are the first to adopt syntactic latent variable to simultaneously improve the diversity and maintain the accuracy of knowledge in knowledgegrounded conversation generation, and propose a novel Syntactically Diverse Adversarial Network.
• Our model gains competitive diversity scores and the best knowledge-accurate scores than baselines.
• We further conduct extensive ablation studies on the proposed several components. These analyses explore intuitive interpretability of why do the adversarial network, knowledge and syntactic latent variable have an effect on our model, and provide a reference for future model design.

Variational Autoencoder
Since our model adopts latent variables, we briefly review the architecture of Variational Autoencoder (VAE) (Kingma and Welling, 2014), a generative model which utilizes a latent variable z to encode the information of the utterance x, and then decodes the original x from z. The probability of x can be computed as follows: where p(z) is the prior distribution, p(x|z) is given by the decoder. Since the integral is unavailable in closed form (Blei et al., 2017), the VAE is trained by maximizing the evidence lower bound (ELBO), which is defined as follows: where q(z|x) is posterior distribution obtained by the encoder, E is mathematical expectation, D KL (·||·) indicates the Kullback-Leibler(KL) Divergence which is utilized to represent the similarity of two distributions.

Generated Adversarial Learning
Generated Adversarial Learning (GAN) (Goodfellow et al., 2014) is widely used in the generation of image and text, which consists of a Generator (G) represents the context information. K i is the relevant knowledge. s i is the syntax of u i obtained by the Parser Toolkit. z s i denotes the syntactic latent variable. The more details of SDAN are shown in Section 3. and a Discriminator (D). The training objective of GAN is defined as follows: where G is utilized to obtain the generated distribution p g (x) from noisy distribution p z (z) to approximate the true distribution p data (x), and D is used to distinguish the distribution of p g (x) and p data (x). G attends to reduce the value of V to make the generated distribution unrecognized, but D intends to enlarge the value of V to effectively identify the true and false classes of data. In the process of training, G and D are optimized alternately, and the optimal solution can be achieved by iterating for many times.

Task Formulation and Model Overview
Formally, we assume the training data D consists of N samples of conversations{c 1 , c 2 , ..., c N } where each c i is a sequence of utterances {u 1 , u 2 , ..., u n } which is expressed as {u t } n t=1 . We consider the {u t } n−1 t=1 as query, the {u t } n t=2 as response. Each query has m related knowledge (k 1 , . . . , k m )，where each knowledge k i is a triplet (h i , r i , t i ), and h i , r i and t i are the head entity, the relation and the tail entity, respectively. Each utterance has the syntax s i . The goal of our method is to generate informative and diverse responses, so we will fuse knowledge and syntax to the generative model.
The overview of SDAN is shown in Figure  1. The Adversarial Hierarchical Semantic Network consists of encoder layer and context layer, which is utilized to model the semantic information. The Knowledge-Aware Network adopts attention mechanism to focus the more important knowledge. The Syntactically Latent Variable Network adopts a latent variable to generate responses with diverse syntax. Finally, the semantic information, knowledge and syntax from above three networks are concatenated together to the Decoder.

Adversarial Hierarchical Semantic Network
The Hierarchical Semantic Network consists of two layer neural networks. Each input utterance u i is encoded into a vector h enc t by the encoder RNN, which is shown as follows: where f enc θ (·) is a bidirectional gated recurrent unit (BiGRU).
The context vector h ctx t represents the historical information, which updates its hidden states by using the encoder vector h enc t and is calculated by: where the initial value of h ctx t is 0. The semantic information from the hierarchical semantic network may contain the syntactic information, which can lead to poor syntactic controllability. In order to solve this problem, we introduce adversarial network to prevent semantic information from containing syntactic information. Specifically, we introduce a discriminator to predict the syntax tree sequence s t according to the semantic information of the context h ctx t . The context layer and encoder layer can be regarded as the generator. The generator is trained to learn the semantic information to prevent the discriminator predicting the syntax from the semantic information and to cheat the discriminator by maximizing the adversarial loss, that is, minimizing the following formula：

Knowledge-Aware Network
The knowledge can be retrieved from the knowledge base to select the related knowledge. The knowledge used in this paper is given in the dataset and one query may have multiple knowledge, so we employ attention mechanism to pay more attention to the important knowledge, which is similar to . We assume that there is m related knowledge (k 1 , . . . , k m ) given for a query u t , and each knowledge k i is a triplet (h i , r i , t i ). First, we treat the average word embeddings of h i and r i as the key vector kv i (i = 1, . . . , m). Then, we use the word embedding of the query u to attend to kv i : where emb(·) is the embedding vector, softmax(·) is a generalization of the logistic function which normalizes all values between 0 and 1. After that, we obtain the knowledge k t by summing all the weighted tail entity t i : Finally, we utilize a BiGRU to encode the knowledge to model the knowledge vector h kno t , which is computed as follows: where f kno θ (·) is a BiGRU.

Syntactically Latent Variable Network
Each utterance contains syntactic information, which is usually represented by syntactic tree. The syntactic tree can be modeled by a neural network or obtained by the parser toolkit. In this paper, we first utilize the Stanford Parser toolkit 1 to process all the utterances in the dataset to get their syntactic tree sequences, which contain the syntactic tokens and the brackets (the brackets represent the syntactic structures). Then, a SynEncoder is employed to represent the syntactic vector h syn t , which is shown as follows: where f syn θ (·) is a BiGRU, s t is the syntactic tree sequence.
Finally, in order to generate syntactically diverse responses, we adopt a syntactic latent variable z s t to control the syntactic information. We define the prior distribution of z s t as: where N (·) is a Gaussian distribution, µ s and σ s are the means and the diagonal variances of the prior distributions, respectively, which are calculated as: where MLP θ (·) is a feed-forward neural network and Softplus(·) is an activation function which can keep the result positive.
For the posterior distribution of z s t , we use h s t and h s t+1 to calculate it in training set (h s t in test set): where

Decoder
From the three networks mentioned above, we obtain the representation of semantics, knowledge and syntax, We concatenate them together to be the initial state of the decoder,which is shown as follows: Finally, we output the response u t+1 , which is shown as follows: where f dec θ (·) is a GRU.

Training Objective
Because of the existence of latent variables in our model, the training objective of latent variables is to maximize the following ELBO: where loss rec is the reconstruction loss, loss KL is the KL divergence to represent the similarity of the posterior distribution and the prior distribution of the latent variable z s t . Then, the final objective is to minimize the following formula: where the two losses are optimized iteratively.

Evaluation Design
We evaluate the generated responses from two aspects: automatic evaluation metrics and manual evaluation metrics. For automatic evaluation metrics, we utilize four classes of evaluation metrics: Token-level Metrics: Perplexity (PPL) is used to evaluate whether the generated response is grammatical and fluent. Overlapping-based Metrics: We adopt the BLEU-2/3 (Papineni et al., 2002) to evaluate the reconstruction performance, which can reflect how well the model could preserve information from knowledge and ground truth response. Embedding-based Metrics: Average, Greedy and Extrema are adopted to measure the semantic similarity between words in generated response and the ground truth. Diversity: We employ Dist-1/2 (Li et al., 2016a) to measure the diversity of the responses, which are defined as the ratio of distinct uni/bi-grams. Knowledge Utilization: E match is the averaged number of the entities matched with the related knowledge triplets in the responses Wu et al., 2020).
For manual evaluation metrics, three evaluation metrics are adopted, which range from 1 to 5: Coherence (Cohe) denotes the semantic similarity of response and query: ① score 1: The response and query are completely different and semantically different. ② score 2: The response and query are completely different, but a little semantically similar. ③ score 3: The response and query are partly the same, but semantically similar. ④ score 4: The response and query are mostly the same, but semantically very similar. ⑤ score 5: The response and query are exactly the same.  Fluency (Flu) represents the grammatical problem: ① score1: The response can not understand. ② score2: The response has more than four grammatical errors and is difficult to understand. ③ score3: The response has three or four grammatical errors and is not fluent. ④ score4: The response has one or two grammatical errors and is fluent. ⑤ score5: The response has no grammatical errors and is fluent.
Informativeness (Info) is designed to measure whether the response is relevant to the knowledge information: ① score 1: The response does not contain the relevant knowledge and relevant to the context. ② score 2: The response does not contain the relevant knowledge, but relevant to the context. ③ score 3: The response only contains one relevant knowledge. ④ score 4: The response contains part of the relevant knowledge. ⑤ score 5: The response contains all the relevant knowledge.

Results of Automatic Evaluation
The results of automatic evaluation metrics are shown in Table 2. We analyze the results from the following perspective: The influence of semantic and syntactic latent variables: 1) Although our improvement on some domains is limited, but we achieve balance between syntactic diversity and knowledge accuracy.
2) In terms of embedding-based metrics (Average, Extrema and Greedy), there is little difference among the three models. So we can conclude that adopting the semantic and syntactic latent variables  have little effect on the semantics of responses.
3) Compared with HRED+know, VHRED +know obtains lower BLEU-k scores and higher Dist-k scores, and SDAN performs better in these two aspects. We can find that although semantic hidden variables can significantly improve the diversity, but also greatly reduce the accuracy of responses. But the syntactic latent variables can not only improve the diversity but also enhance the accuracy of responses. The reason is that semantic latent variables may utilize other words with similar semantics, which will lead to the inaccuracy of the knowledge, while the syntactic latent variables only change the syntax of responses, which has no influence on the accuracy of knowledge.  Table 4: Ablation study on KdConv Corpus. The "-adv", "-know" and "-syn" mean that we eliminate the adversarial network (discriminator), knowledge-aware network and syntactically latent variable network, respectively.
4) It can be seen that the Dist-k scores of VHRED+know is higher than SDAN, which indicates that semantic latent variables are more effective than syntactic latent variables in improving diversity. The reason may be that the vocabularies of semantics are much larger than syntactic vocabularies. 5) For PPL, VHRED+know obtains the best results and SDAN performs better than HRED+know, which denotes that both of the semantic and syntactic latent variables have the positive influence on generating fluent responses and the former works better.

Comparison between domains:
The performance on BLEU-k improves from film domain to travel domain, because there are 1,837 entities and 318 relations in the film domain and 699 entities and 7 relations in the travel domains. The more diverse knowledge increases the difficulty of knowledge selection for knowledgeaware network.

Results of Manual Evaluation
The results of manual evaluation metrics are shown in Table 3. The scores of three evaluation metrics range from 1 to 5. Additionally, we choose 3 annotators to evaluate the responses generated by the above models, and randomly select 50 conversa-tions from the test set.
For Coherence, the three models are similar in maintaining semantic consistency, which agrees with the results of automatic evaluation. VHRED+know achieves the best Fluency scores and the worst Informativeness scores, which proves that the semantic latent variable can lead to the inaccuracy of knowledge, but can improve the fluency of responses again. Our model obtains the competitive Coherence, Fluency scores and the best Informativeness scores, which indicate that our model can not only generate informative responses but also keep the semantic coherence.

Ablation Study
To analyze which components are driving the improvements, we present an ablation study in Table  4. We eliminate the adversarial network (discriminator), knowledge-aware network and syntactically latent variable network one by one, which result in four models. The four models are represented as "-adv", "-adv-know", "-adv-syn" and "-adv-knowsyn" respectively. By comparing the four models with our SDAN, we can make some conclusions as follows: 1) After eliminating the adversarial network (comparing SDAN with "-adv"), "-adv" has worse performance than SDAN, which indicates that the  adversarial network is effective to enhance the semantics, knowledge accuracy, distinct and fluency, and it is necessary to decouple semantics from syntax.
2) When further removing the knowledge-aware network (comparing "-adv" with "-adv-know"), all the results are worse again, especially the decline of BLEU-k scores is obvious, which denotes that introducing knowledge is essential for conversation generation.
3) While eliminating the syntactically latent variable (comparing "-adv" with "-adv-syn" or comparing "-adv-know" with "-adv-know-syn"), it can be seen that there is a slight improvement in the scores of Average, Extrema, Greedy and BLEU-k, and a bit of lower in the scores of Dist-k, which prove that adopting syntactically latent variable can slightly reduce the semantic consistency and knowledge accuracy, but improve the diversity. Moreover, when the syntactic information and semantic representation exist simultaneously, it certainly need to decouple them by utilizing adversarial network to prevent the influence between them.

Case Study
The generated responses of HRED, all baselines and our model sampled from test set in film domain are shown in Table 5. As it can be seen, HRED tends to generate generic or irrelevant responses. After introducing knowledge, HRED+konw can generate coherent and informative responses related to the given knowledge. When adopting semantic latent variable, VHRED+know prefer generating responses relevant to the context. while utilizing knowledge and syntactically latent variable, our model can generate knowledge-coherent and diverse responses.

Related Work
Sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014;Shang et al., 2015) with attention (Bahdanau et al., 2015; has been widely used in the conversation generation. However, models tend to generate meaningless and generic responses (Serban et al., 2017). To alleviate this issue, researchers have utilized context (Sordoni et al., 2015;Yao et al., 2015), topic information (Xing et al., 2016(Xing et al., , 2017 or knowledge Wu et al., 2020; to enhance response quality. The studies of knowledge-grounded conversation generation mainly focus on the method of knowledge retrieval  or knowledge fusion (Wu et al., 2020;Ye et al., 2020;Liang et al., 2021) with copy mechanism. The knowledge-grounded models can improve the ac-curacy of knowledge, but the responses generated by some of them may lack the diversity, which is also a significant reason for generating generic responses.
Recently, to tackle the lack of diversity, researchers have begun to introduce the beam search algorithm (Li et al., 2016b;Vijayakumar et al., 2016) to decoder or latent variables (Serban et al., 2017;Park et al., 2018;Shen et al., 2019). Adopting latent variables can significantly improve the diversity of responses, but it will lead to the inaccuracy of knowledge. To the best of our knowledge, this problem has not been investigated in conversation generation so far.
Different from all the models mentioned above, our approach introduces syntax to conversation generation. We propose a syntactically diverse adversarial network, which utilizes latent variables to control the syntactic diversity. Additionally, we utilize adversarial learning to preserve the disentanglement of syntax and semantics for preventing them from influencing each other. Our model can not only generate sentences with diverse syntax but also keep the accuracy of knowledge.

Conclusion
In this paper, we propose a Syntactically Diverse Adversarial Network for knowledge-grounded conversation model, which utilizes adversarial hierarchical semantic network, knowledge-aware network and syntactical latent variable network to model the semantics, knowledge and diverse syntax information. Moreover, our model adopts adversarial learning to enhance the controllability of syntax. According to automatic and manual evaluation, our model competitively improves the quality of generated responses, and obtains better trade-off between improving the diversity and preserving the knowledge accuracy.