Discovering Dialog Structure Graph for Coherent Dialog Generation

Learning discrete dialog structure graph from human-human dialogs yields basic insights into the structure of conversation, and also provides background knowledge to facilitate dialog generation. However, this problem is less studied in open-domain dialogue. In this paper, we conduct unsupervised discovery of discrete dialog structure from chitchat corpora, and then leverage it to facilitate coherent dialog generation in downstream systems. To this end, we present an unsupervised model, Discrete Variational Auto-Encoder with Graph Neural Network (DVAE-GNN), to discover discrete hierarchical latent dialog states (at the level of both session and utterance) and their transitions from corpus as a dialog structure graph. Then we leverage it as background knowledge to facilitate dialog management in a RL based dialog system. Experimental results on two benchmark corpora confirm that DVAE-GNN can discover meaningful dialog structure graph, and the use of dialog structure as background knowledge can significantly improve multi-turn coherence.


Introduction
With the aim of building a machine to converse with humans naturally, some work investigate neural generative models (Shang et al., 2015;Serban et al., 2017). While these models can generate locally relevant dialogs, they struggle to organize individual utterances into globally coherent flow (Yu et al., 2016;Xu et al., 2020b). The possible reason is that it is difficult to control the overall dialog flow without background knowledge about dialog structure. 1 However, due to the complexity of opendomain conversation, it is laborious and costly to annotate dialog structure manually. Therefore, it is of great importance to discover open-domain dialog structure from corpus in an unsupervised way for coherent dialog generation.
Some studies tried to discover dialog structure from task-oriented dialogs . However, the number of their dialog states is limited to only dozens or hundreds, which cannot cover finegrained semantics in open-domain dialogs. Furthermore, the dialog structures they discovered generally only contain utterance-level semantics (non-hierarchical), without session-level semantics (chatting topics) that are essential in open-domain dialogs Kang et al., 2019;Xu et al., 2020c). 2 Thus, in order to provide a full picture of open-domain dialog structure, it is desirable to discover a two-layer directed graph that contains session-level semantics in the upper-layer vertices, utterance-level semantics in the lower-layer vertices, and edges among these vertices.
In this paper, we propose a novel discrete variational auto-encoder with graph neural network (DVAE-GNN) to discover a two-layer dialog structure from chitchat corpus. Intuitively, since discrete dialog states are easier to capture transitions for dialog coherence, we use discrete variables to represent dialog states (or vertices in the graph) rather than dense continuous ones in most VAEbased dialog models (Serban et al., 2017;Zhao et al., 2017). Specifically, we employ an RNN Encoder with softmax function as vertex recognition module in DVAE, and an RNN decoder as reconstruction module in DVAE, as shown in Figure  3. Furthermore, we integrate GNN into DVAE to model complex relations among discrete variables for more effective discovery. The parameters of DVAE-GNN can be optimized by minimizing a reconstruction loss, without the requirement of any annotated datasets.  As shown in Figure 1, with well-trained DVAE-GNN, we build the dialog structure graph by three steps. First, we map all dialog sessions to utterance-level and session-level vertices, as shown in Figure 1 (b); Second, we calculate cooccurrence statistics of mapped vertices for all dialog sessions, as shown in Figure 1 (c). 3 Finally, we build edges among vertices based on all collected co-occurrence statistics to form the dialog structure graph, as shown in Figure 1 (d).
To prove the effectiveness of the discovered structure, we propose a hierarchical reinforcement learning (RL) based graph grounded conversational system (GCS) to leverage it for conversation generation. As shown in Figure 2, given a dialog context, GCS first maps it to a utterance-level vertex, and then learns to walk over graph edges, and finally selects a contextual appropriate utterance-level vertex to guide response generation at each turn.
Our contribution includes: (1) we identify the task of unsupervised dialog structure graph discovery in open-domain dialogs. (2) we propose a novel model, DVAE-GNN, for hierarchical dialog struc-ture graph discovery. Experimental results on two benchmark corpora demonstrate that we can discover meaningful dialog structure, the use of GNN is crucial to dialog structure discovery, and the graph can improve dialog coherence significantly.
2 Related Work 2.1 Dialog structure learning for task-oriented dialogs There are previous work on discovering humanreadable dialog structure for task-oriented dialogs via hidden Markov models (Chotimongkol, 2008;Ritter et al., 2010;Zhai and Williams, 2014) or variational auto-encoder . However, the number of their dialog states is limited to only dozens or hundreds, which cannot cover fine-grained semantics in chitchat. Moreover, our method can discover a hierarchical dialog structure, which is different from the non-hierarchical dialog structures in most previous work.

Knowledge aware conversation
generation There are growing interests in leveraging knowledge bases for generation of more informative responses (Moghe et al., 2018;Dinan et al., 2019;Liu et al., 2019;Xu et al., 2020c,a). In this work, we employ a dialog-modeling oriented graph built from dialog corpora, instead of a external knowledge base, in order to facilitate multi-turn dialog modeling.

Latent variable models for chitchat
Recently, latent variables are utilized to improve diversity (Serban et al., 2017;Zhao et al., 2017;Gu et al., 2019;Gao et al., 2019;Ghandeharioun et al., 2019), control responding styles Li et al., 2020) and incorporate knowledge (Kim et al., 2020) Figure 3: Overview of our algorithm "DVAE-GNN" for discovering a dialog structure graph from dialog dataset. FFN denotes feed-forword neural networks and Emb refers to embedding layers Algorithm 1 Phrase extraction Input: An utterance U Output: A set of phrases E extracted from U 1: Obtain a dependency parse tree T for U ; 4 2: Get all the head words HED that are connected to ROOT node, and all the leaf nodes in T (denoted as L); 3: for each leaf node in |L| do 4: Extract a phrase consisting of words along the tree from HED to current leaf node, denoted as ei; 5: If ei is a verb phrase, then append it into E; 6: end for 7: return E theirs in that: (1) we focus on open-domain dialog structure discovery. (2) we use discrete latent variables to model dialog states instead of dense continuous ones in most previous work.

Problem Definition
Given a corpus D that contains |D| dialog sessions {X 1 , X 2 , ..., X |D| }, where each dialog session X consists of a sequence of c utterances, and X = [x 1 , ..., x c ]. The objective is to discover a two-layer dialog structure graph G = {V, E} from all dialog sessions in D, where V is the vertex set and E is the edge set. Specifically, V consists of two types, v s m (1 ≤ m ≤ M ) for session-level vertices (topics) and v u n (1 ≤ n ≤ N ) for utterance-level vertices. E contains three types: edges between two sessionlevel vertices (denoted as Sess-Sess edges), edges between two utterance-level vertices (denoted as Utter-Utter edges), and edges between an utterancelevel vertex and its parent session-level vertices (denoted as Sess-Utter edges). Figure 3 shows the proposed DVAE-GNN framework. It contains two procedures, vertex recognition that maps utterances and sessions to vertices (as the role of recognition module in VAE (Kingma and Welling, 2014)), and utterance reconstruction that regenerates all utterances in sessions (as the role of decoding module in VAE).

Graph Initialization
Vertex Initialization. Theoretically, we can cold start the representation learning of vertices in dialog structure graph. In practice, to accelerate the learning procedure, we warm start each utterancelevel vertex representation with the combination of two parts: one discrete latent variable and one distinct phrase. The associated phrase with each utterance-level vertex provides prior semantic knowledge for utterance-level vertex representation, which is beneficial for reducing the learning difficulty. Specifically, we first extract distinct phrases from all dialog utterances with Algorithm 1. Then we choose the top-N most frequent extracted phrases (the same number as utterance-level vertices), and then randomly match utterance-level vertices and the phrases in pairs during initialization. Notice that the association relations are not changed afterwards.
Formally, we use Λ s and Λ x to represent the hidden representation matrix of discrete sessionlevel and utterance-level vertices respectively. The calculation can be shown as follows: where Λ s [m] denotes the representation vector of m-th session-level vertex, Λ x [n] denotes the representation vector of n-th utterance-level vertex, v s m and v u n are one-hot vectors of discrete vertices, e(ph n ) denotes the representation vector of the associated phrase ph n with v u n , W u and W s are parameters, and ";" denotes concatenation operation. Specifically, for phrase representation, we first feed word sequence in the phrase to an RNN encoder and obtain their hidden vectors. Then we compute the average pooling value of these hidden vectors as e(ph n ).

Edge Initialization
We build an initial Utter-Utter edge between two utterance-level vertices when their associated phrases can be extracted sequentially from two adjacent utterances in the same dialog session.

Vertex Recognition
Utterance-level Vertex Recognition. For each utterance x i in a dialog session, we map it to an utterance-level vertex. Specifically, we first encode the utterance x i with an RNN encoder to obtain its representation vector e(x i ). Then, we calculate the posterior distribution of the mapped utterance-level vertex, z i , by a feed-forward neural network (FFN): ( 3) Finally, we obtain the mapped utterance-level vertex, z i , by sampling from the posterior distribution with Gumbel-Softmax (Jang et al., 2017). Here, we can obtain an utterance-level vertex sequence after mapping each utterance in one dialog session, where the sequence is utilized for sessionlevel vertex recognition.
Session-level Vertex Recognition. We assume that each session-level vertex corresponds to a group of similar utterance-level vertex sequences that are mapped by different dialog sessions. And these similar sequences might have overlapped utterance-level vertices. To leverage this locally overlapping vertex information for encouraging mapping similar utterance-level vertex sequences to similar session-level vertices, we employ graph neural network to model complex relations among vertices for session-level vertex recognition. Specifically, we utilize a three-layer graph convolution network (GCN) over Utter-Utter edges to calculate structure-aware utterance-level semantics. The calculation is defined by: where h j v u n denotes the j-th layer structure-aware representation for the n-th utterance-level vertex v u n . σ j is the sigmoid activation function for the j-th layer, and N (v u n ) is the set of utterancelevel neighbors of v u n in the graph. Here, we can obtain a structure-aware semantic sequence represents the final structure-aware representation of i-th mapped utterance-level vertex, v u z i . Then, we feed the structure-aware semantic sequence to an RNN encoder, denoted as the vertexsequence encoder, to obtain the structure-aware session representation e(z 1,...,c ). We calculate the posterior distribution of the mapped session-level vertex, g, as follows: Then, we obtain the mapped session-level vertex, g, by sampling from the session-level posterior distribution with Gumbel-Softmax.

Utterance Reconstruction
We reconstruct all utterances in the dialog session by feeding these mapped vertices into an RNN decoder (denoted as the reconstruction decoder). Specifically, to regenerate utterance x i , we concatenate the representation vector of mapped utterancelevel vertex Λ x [z i ] and the representation vector of the mapped session-level vertex Λ s [g], as the initial hidden state of the reconstruction decoder.
Finally, we optimize the DVAE-GNN model by maximizing the variational lower-bound (ELBO) (Kingma and Welling, 2014). Please refer to Appendix D for more details.

Graph Construction
After training DVAE-GNN, we construct the dialog structure graph with well-trained DVAE-GNN by three steps, as shown in Figure 1. Specifically, we first map all dialog sessions in corpus to vertices by Equation 3 and 5.
Then, we collect co-occurrence statistics of these mapped vertices. Specifically, we count the total mapped times for each session-level vertex, denoted as #(v s i ), and those for each utterance-level vertex, denoted as #(v u j ). Furthermore, we collect the co-occurrence frequency of a session-level vertex and an utterance-level vertex that are mapped by a dialog session and an utterance in it respectively, denoted as #(v s i , v u j ). Moreover, we collect the co-occurrence frequency of two utterance-level vertices that are sequentially mapped by two adjacent utterances in a dialog session, denoted as Finally, we build edges between vertices based on these co-occurrence statistics. We first build a directed Utter-Utter edge from is above a threshold α uu . Then, we build a bidirectional Sess-Utter edge between v u j and v s k if the probability is the number of utterance-level vertices that are connected to both session-level vertices. Here, Sess-Sess edges are dependent on Sess-Utter edges.

Graph Grounded Dialog Generation
To prove the effectiveness of the discovered structure for coherent dialog generation, we utilize a graph grounded conversation system (GCS) following (Xu et al., 2020a). Different from single-layer policy in Xu et al. (Xu et al., 2020a), we present a hierarchical policy for two-level vertex selection. The GCS contains three modules: (1) a dialog context understanding module that maps given dialog context (the previous two utterances) to an utterance-level vertex (called as hit utterance-level vertex) in the graph with well-trained DVAE-GNN, (2) a hierarchical policy that learns to walk over one-hop graph edges (for dialog coherence) to select an utterance-level vertex to serve as response content, and (3) a response generator that generate an appropriate response based on the selected utterance-level vertex. Specifically, a session-level sub-policy first selects a session-level vertex as current dialog topic. Then, an utterance-level subpolicy selects an utterance-level vertex from current dialog topic's child utterance-level vertices.
Session-level sub-policy Let A g s l denote the set of session-level candidate actions at time step l. It consists of all parent session-level vertices of the hit utterance-level vertex. Given current RL state s l at the time step l, the session-level sub-policy µ g selects an appropriate session-level vertex from A g s l as the current dialog topic. Specifically, µ g is formalized as follows: , where e s l is the aforementioned RL state representation, c g j the j-th session-level vertex in A g s l , and N g l is the number of session-level vertices in A g s l . Utterance-level sub-policy Let A u s l denote the set of utterance-level candidate actions at time step l. It consists of utterance-level vertices that are connected to the vertex of current dialog topic. Given current state s l at the time step l, the utterancelevel sub-policy µ u selects an optimal utterancelevel vertex from A u s l . Specifically, µ u is defined as follows: .
Here, e s l is the aforementioned RL state representation, c u j is the j-th utterance-level vertex in A u s l , and N u l is the number of utterance-level candidate vertices in A u s l . With the distribution calculated by the above equation, we utilize Gumbel-Softmax to sample an utterance-level vertex from A u s l , to provide response content for response generator, which is a Seq2Seq model with attention mechanism.
To train RL, we use a set of rewards including utterance relevance, utter-topic closeness, and repetition penalty. For the session-level sub-policy, its reward r g is the average rewards from the utterancelevel sub-policy during current dialog topic. The reward for the utterance-level sub-policy, r u , is a weighted sum of the below-mentioned factors. The default values of weights are set as [60, 0.5, -0.5]. 5 i) Utterance relevance We choose the classical multi-turn response selection model, DAM in (Zhou et al., 2018), to calculate utterance relevance. We expect the generated response is coherent to dialog context.
ii) Utter-topic closeness The selected utterancelevel vertex v u j should be closely related to current topic v s i . And we use the #(v s i , v u j )/#(v u j ) in Section 3.5 as the utter-topic closeness score.
iii) Repetition penalty This factor is 1 when the selected utterance-level vertex shares more than 60% words with one of contextual utterance, otherwise 0. We expect that the selected utterance-level vertices are not only coherent, but also diverse.
Further implementation details can be found in the Appendix C.

Experiments for Dialog Structure
Graph Discovery

Datasets and Baselines
We evaluate the quality of dialog structure graph discovered by our method and baselines on two In this work, we select DVRNN (Shi et al., 2019) as a baseline, since there is few previous study on unsupervised open-domain dialog structure discovery. DVRNN is the SOTA unsupervised method in discovering dialog structure in task-oriented dialogs, which outperforms other hidden Markov based methods by a large margin . We rerun the original source codes. 6 Notice that, to suite the setting of open-domain dialog and also consider the limit of our 16G GPU memory (we set batch size as 32 to ensure training efficiency), we 6 github.com/wyshi/Unsupervised-Structure-Learning set the number of dialog states as 50 (originally it is 10). 7 We also evaluate the quality of the initialized graph (denoted as Phrase Graph) that consists of only phrases (as vertices) and initial edges (between phrases) in Section 3.2. For more details, please refer to Appendix A.1.

Evaluation Metrics
We evaluate discovered dialog structure graph with both automatic evaluation and human evaluation. For automatic evaluation, we use two metrics to evaluate the performance of reconstruction: (1) NLL is the negative log likelihood of dialog utterances; (2) BLEU-1/2 measures how much that reconstructed sentences contains 1/2-gram overlaps with input sentences (Papineni et al., 2002). The two metrics indicate how well the learned dialog structure graph can capture important semantic information in dialog dataset.
Further, we manually evaluate the quality of edges and vertices in the graph. For edges, (1) S-U Appr. for multi-turn dialog coherence. It measures the appropriateness of Sess-Utter edges, where these edges provide crucial prior information to ensure multi-turn dialog coherence (see results in Section 5.4). "1" if an utterance-level vertex is relevant to its session-level vertex (topic), otherwise "0". (2) U-U Appr. for single-turn dialog coherence: It measures the quality of Utter-Utter edges between two utterance-level vertices, where these edges provide crucial prior information to ensure single-turn dialog coherence. It is "1" if an Utter-Utter edge is suitable for responding, otherwise "0". Notice that we don't evaluate the quality of Sess-Sess edges because Sess-Sess edges are dependent on the statistics of Sess-Utter edges.
Meanwhile, for vertices, we evaluate Sessionlevel Vertex Quality (Sess.V.-Qual.). Ideally, a session-level vertex (topic) should be mapped by dialog sessions that share high similarity. In other words, we can measure the quality of a sessionlevel vertex by evaluating the similarity of semantics between two sessions that are mapped to it. It is "2" if the two sessions mapped to the same sessionlevel vertex are about the same or highly similar topic, "0" if the two sessions contains different topic, otherwise "1". Specifically, during evaluation, we provide typical words of each topic by calculating TF-IDF on utterances that are mapped to it. High "Sess.V.-Qual." is beneficial to conduct topic management for coherent multi-turn dialogs. Note that we don't evaluate utterance-level vertex quality since it is too fine-grained for annotators to determine whether two utterances that are mapped to a utterance-level vertex are "highly-similar".
For human evaluation, we sample 300 cases and invite three annotators from a crowd-sourcing platform to evaluate each case. 8 Notice that all system identifiers are masked during human evaluation.

Experiment Results
As shown in Table 1, DVAE-GNN significantly outperforms DVRNN, in terms of all the metrics (sign test, p-value < 0.01) on the two datasets. It demonstrates that DVAE-GNN can better discover meaningful dialog structure graph. Specifically, DVAE-GNN obtains the best results in terms of NLL and BLEU-1/2, which shows that DVAE-GNN can better capture important semantic information in comparison with DVRNN. Meanwhile, DVAE-GNN also surpasses all baselines in terms of "U-U Appr." and "S-U Appr.". It indicates that our discovered dialog structure graph has higher-quality edges and can better facilitate coherent dialog generation.
Furthermore, we conduct ablation study. Specifically, to evaluate the contribution of GNN, we remove GNN from DVAE-GNN, denoted as DVAE-GNN w/o GNN. We see that its performance drop sharply in terms of "S-U Appr." and "Sess.V.-Qual.". It demonstrates that GNN can better incorporate the structure information (complex relations 8 test.baidu.com among vertices) into session-level vertex representation learning. Moreover, to evaluate the contribution of phrases to utterance-level vertex representation, we remove phrases, denoted as DVAE-GNN w/o phrase. We see that its scores in terms of all the metrics drops sharply, especially the three human evaluation metrics. The reason is that it's difficult to learn high-quality utterance-level vertex representation from a large amount of fine-grained semantic content in open-domain dialogs without any prior information. The Kappa value is above 0.4, showing moderate agreement among annotators.
Two sample parts of the discovered dialog structure graph can be found in Appendix B.

Experiments for Graph Grounded Dialog Generation
To confirm the benefits of discovered dialog structure graph for coherent conversation generation, we conduct experiments on the graph discovered from Weibo corpus. All the systems (including baselines) are trained on Weibo corpus.

Models
We carefully select the following six baselines.

MMPMS It is the multi-mapping based neural
open-domain conversational model with posterior mapping selection mechanism (Chen et al., 2019), which is a SOTA model on the Weibo Corpus.
MemGM It is the memory-augmented opendomain dialog model (Tian et al., 2019), which learns to cluster U-R pairs for response generation.

CVAE It is the Conditional Variational Auto-
Encoder based neural open-domain conversational model (Zhao et al., 2017). DVRNN-RL It discovers dialog structure graph for task-oriented dialog modeling .

VHCR-EI
GCS It is our proposed dialog structure graph grounded dialog system with hierarchical RL.

GCS w/ UtterG
It is a simplified version of GCS that just uses the utterance-level graph and utterance-level sub-policy.
GCS w/ Phrase Graph It is a simplified version of GCS that just uses the phrase graph and utterancelevel sub-policy.  We use the same user simulator for RL training of DVRNN-RL, GCS and GCS w/ UtterG. Here, we use the original MMPMS as user simulator because it achieves the best result on the Weibo Corpus. The user simulator is pre-trained on dialog corpus and not updated during policy training. We use the original source codes for all the baselines and the simulator. Further details about baselines and GCS can be found in Appendix A.2.
We conduct model-human dialogs for evaluation. Given a model, we first randomly select an utterance (the first utterance in a session) from test set for the model side to start the conversations with a human turker. Then the human is asked to converse with the selected model till 8 turns are reached. Finally, we obtain 50 model-human dialogs for multi-turn evaluation. Then we randomly sample 200 U-R pairs from the above dialogs for single-turn evaluation.

Evaluation Metrics
Since the proposed system does not aim at predicting the highest-probability response at each turn, but rather the long-term success of a dialog (e.g., coherence), we do not employ BLEU (Papineni et al., 2002) or perplexity for evaluation. We use three multi-turn evaluation metrics and three singleturn metrics. For human evaluation, we invite three annotators to conduct evaluation on each case, and we ask them to provide 1/0 (Yes or No) scores for most of the metrics. Moreover, for multi-turn coherence, we first ask the annotators to manually segment a dialog by topics and then conduct evaluation on each session. A session refers to a dialog fragment about one topic. Notice that system identifiers are masked during human evaluation.
Multi-turn Metrics. We use the following metrics: (1) Multi-turn Coherence (Multi.T.-Coh.) It measures the coherence within a session. Common incoherence errors in a session include amphora errors across utterances and information inconsistency. "0" means that there are more than two incoherence errors in a session. "1" means that there are only one error. "2" means that there are no errors. Finally, we compute the average score of all the sessions. (2) Dialog engagement (Enga.) This metric measures how interesting a dialogs is. It is "1" if a dialog is interesting and the human is willing to continue the conversation, otherwise "0". (3) Length of high-quality dialog (Length) A high-quality dialog ends if the model tends to produce dull responses or two consecutive utterances are highly overlapping (Li et al., 2016b).

Experiment Results
As shown in Table 2, GCS significantly outperforms all the baselines in terms of all the metrics except "Length-of-dialog" (sign test, p-value < 0.01). It indicates that GCS can generate more coherent, informative and engaging dialogs. Specifically, our system's two sub-policies strategy on the dialog structure graph enables more coherent dialog flow control than hierarchical latent variable based VHCR-EI model that performs the best among Figure 4: A sample dialog between our dialog system GCS and a human, where"Bot" is our system and "User" is the human. This dialog contains three dialog topics. We translate the original Chinese texts into English language. baselines, as indicated by "Multi.T.-Coh.". Moreover, our high-quality edges between utterancelevel vertices (measured by the metric "U-U Appr." in Table 1) help GCS to achieve higher single-turn coherence score than DVRNN-RL, as indicated by "Single.T.-Coh.". In addition, GCS, VHCR-EI, MMPMS and CVAE can obtain better performance in terms of "Info.", indicating that latent variable can effectively improve response informativeness. The Kappa value is above 0.4, showing moderate agreement among annotators. Figure 4 shows a sample dialog between our system "GCS" and a human. We see that our system can generate a coherent, engaging and informative multi-turn dialog. For an in-depth analysis, we manually segment the whole dialog into two sessions. It can be seen that the first session is about "meeting appointment", and it contains a reasonable dialog logic, I will have a holiday → I will arrive → wait for you at home → look forward to a big meal. And the second session is about "joking between friends", and it also contains a reasonable logic, you are beautiful → flattering me → I am sorry.

Case Study of Conversation Generation
Ablation Study. In order to evaluate the contribution of session-level vertices, we run GCS with an utterance-level dialog structure graph, denoted as "GCS w/ UtterG". Results in Table 2 show that its performance in terms of "Multi.T.-Coh." and "Enga." drops sharply. It demonstrates the contribution of our hierarchical dialog structure graph for enhancing dialog coherence and dialog engagement. The possible reason for the inferior performance of "GCS w/ UtterG" is that the removal of session-level vertices harms the capability of selecting coherent utterance-level vertex sequence.

Conclusion
In this paper, we conduct unsupervised discovery of discrete dialog structure from chitchat corpora. Further, we try to formalize the structure as a twolayer directed graph. To discover the dialog structure, we present an unsupervised model, DVAE-GNN, which integrates GNN into DVAE to model complex relations among dialog states for more effective dialog structure discovery. Experimental results demonstrate that DVAE-GNN can discover meaningful dialog structure, and the use of dialog structure as background knowledge can significantly improve multi-turn dialog coherence.

A Implementation Details
A.1 Implementation Details about DVAE-GNN For all models, we share the same vocabulary (maximum size is 50000) and initialized word embedding (dimension is 200) with the pre-trained Tencent AI Lab Embedding. 9 Meanwhile, we randomly initialized the embedding space of sessionlevel vertices and latent vectors for utterance-level vertices (dimensions are 200). The hidden sizes of all RNN encoders and RNN decoders are set as 512.
The three threshold variables about co-occurrence statistics α uu , α su and α ss are all set as 0.05.
We use the PaddlePaddle framework for the development of our systems. 10 Notice that it is costly to calculate Equation 3 in Section 3.3 since the total number of utterancelevel vertices, N , is very large (more than one million). In practice, for each utterance, we first retrieve the top-50 most related utterance-level vertices according to Okapi BM25 (Robertson and Zaragoza, 2009) similarity between the utterance and associated phrases of all candidate vertices. And then calculate Equation 3 only with these retrieved vertices. Thus, only a part of vectors in Λ x will be dynamically updated for each training sample when training DVAE-GNN.
Hyper-parameter Setting for Training In our experiments, all the models share the same vocabulary (maximum size is 50000 for both Weibo corpus and Douban corpus), initialized word embedding (dimension is 200) with the Tencent AI Lab Embedding. Moreover, bidirectional one-layer GRU-RNN (hidden size is 512) is utilized for all the RNN encoders and RNN decoders. In addition, dropout rate is 0.3, batch size is 32 and optimizer is Adam(lr=2le-3) for all models. During RL training, the discounting weight for rewards is set as 0.95. The MMPMS model for the user simulator employs 10 responding mechanisms. We utilize dependency parse for phrase extraction. 11 We pretrain the response generator in the Weibo Corpus.
Rewards and Training Procedure for the Graph grounded Conversational System. We use the PaddlePaddle framework for the development of our systems. 12 We hot-start the response generator by pre-training it before the training of policy module. Meanwhile, to make the RL based training process more stable, we employ the A2C method (Sutton and Barto, 2018) for model optimization rather than the original policy gradient as done in previous work (Li et al., 2016b). Moreover, during RL training, the parameters of the policy module are updated, and the parameters of response generator and the representation of semantic vertices stay intact. Figure 5 shows a part of the unified dialog structure graph that is discovered from the Weibo corpus. Each yellow-colored circle in this figure represents a session-level vertex with expert interpreted meanings based on the information of top words (from phrases of utterance-level vertices belonging to this session-level vertex) ranked by TF/IDF. Each green-colored rectangle represents an utterance-level vertex. The directed-arrows between utterance-level vertices represent dialog transitions between states, and the utterance-level vertices within blue dotted-lines are about the same session-level vertex (topic  Figure 6: A part of the unified dialog structure graph that is extracted from Weibo corpus. Here, we interpret session-level semantics based on their child utterance-level vertices. We translate the original Chinese texts into English language. go to Huangshan → comments about Huangshan. Furthermore, it also captures the major logic in dialogs about the topic "want a boyfriend", need a boyfriend → why? → he can accompany me to celebrate the festival. Moreover, it captures a dialog topic transition between the topic "go trveling" and another topic "want a boyfriend". Figure 6 shows another part of the unified dialog structure graph that discovered from the Weibo corpus.

C GCS with RL
In the following, we will elaborate the details of GCS.

C.1 Dialog Context Understanding
Given a dialog context (the last two utterances), we first map it to the graph by recognizing the most related utterance-level vertex with the well-trained DVAE-GNN. Here, the recognized utterance-level vertex is denoted as the hit utterance-level vertex.
For policy learning, we build current RL state s l at time step l by collecting dialog context (the last two utterances), previously selected session-level vertex sequence, and previously selected utterancelevel vertex sequence. Here, we first utilize three independent RNN encoders to encode them respectively, and then concatenate these three obtained representation vectors, to obtain the representation of the RL state, e s l .