Semantic Representation for Dialogue Modeling

Although neural models have achieved competitive results in dialogue systems, they have shown limited ability in representing core semantics, such as ignoring important entities. To this end, we exploit Abstract Meaning Representation (AMR) to help dialogue modeling. Compared with the textual input, AMR explicitly provides core semantic knowledge and reduces data sparsity. We develop an algorithm to construct dialogue-level AMR graphs from sentence-level AMRs and explore two ways to incorporate AMRs into dialogue systems. Experimental results on both dialogue understanding and response generation tasks show the superiority of our model. To our knowledge, we are the first to leverage a formal semantic representation into neural dialogue modeling.


Introduction
Dialogue systems have received increasing research attention (Wen et al., 2015;Serban et al., 2017;Bao et al., 2020), with much recent work focusing on social chats (Ritter et al., 2011;Li et al., 2017) and task-oriented dialogues (Wen et al., 2017;Dinan et al., 2019). There are two salient subtasks in dialogue modeling, namely dialogue understanding (Choi et al., 2018;Reddy et al., 2019; and response generation (Li et al., 2017;Budzianowski et al., 2018). The former refers to understanding of semantic and discourse details in a dialogue history, and the latter concerns making a fluent, novel and coherent utterance.
The current state-of-the-art methods employ neural networks and end-to-end training (Sutskever et al., 2014;Bahdanau et al., 2015) for dialogue modeling. For instance, sequence-to-sequence models have been used to encode a dialogue history, before directly synthesizing the next utterance (Vinyals and Le, 2015;Wen et al., 2017;Bao et al., Dialogue History: … SPEAKER-1 : Recently, I've been obsessed with horror films. SPEAKER-2 : Oh, how can you be infatuated with horror films? They're so scary . SPEAKER-1 : Yeah, you are right I used to not watch horror films, but after seeing Silence of the Lamb with Mike last month, I fell in love with them. SPEAKER-2 : It's amazing. But if I were you, I wouldn't have the courage to watch the first one. SPEAKER-1 : But it's really exciting .

Ground-Truth:
Maybe, but I would rather watch romance, science fiction, crime or even disaster movie instead of a horror picture…

Transformer:
Great. I'm looking forward to it. I just can't keep away from the food that I saw. 2020). Despite giving strong empirical results, neural models can suffer from spurious feature associations in their neural semantic representation (Poliak et al., 2018;Kaushik et al., 2020), which can lead to weak robustness, inducing irrelevant dialogue states (Xu and Sarikaya, 2014;Sharma et al., 2019;Rastogi et al., 2019) and generating unfaithful or irrelevant text (Maynez et al., 2020;Niu and Bansal, 2020). As shown in Figure 1, the baseline Transformer model pays attention to the word "lamb" but ignores its surrounding context, which has important contents (marked with squares) that indicate its true meaning, thereby giving an irrelevant response that is related to food. Intuitively, such issues can be alleviated by having a structural representation of semantic information, which treats entities as nodes and builds structural relations between nodes, making it easy to find the most salient context. Explicit structures are also more interpretable compared to neural representation and have been shown useful for information extraction (Strubell et al., 2018;Sun et al., 2019;Bai et al., 2021;Sachan et al., 2021), summarization (Liu et al., 2015;Hardy and Vlachos, 2018;Liao et al., 2018) and machine translation (Marcheggiani et al., 2018;Song et al., 2019a).
We explore AMR (Banarescu et al., 2013) as a semantic representation for dialogue histories in order to better represent conversations. As shown in the central block of Figure 2, AMR is one type of sentential semantic representations, which models a sentence using a rooted directed acyclic graph, highlighting its main concepts (e.g. "mistake") and semantic relations (e.g., "ARG0" 1 ), while abstracting away function words. It can thus potentially offer core concepts and explicit structures needed for aggregating the main content in dialogue. In addition, AMR can also be useful for reducing the negative influence of variances in surface forms with the same meaning, which adds to data sparsity.
Existing work on AMR parsing focuses on the sentence level. However, as the left block of Figure 2 shows, the semantic structure of a dialogue history can consist of rich cross-utterance coreference links (marked with squares) and multiple speaker interactions. To this end, we propose an algorithm to automatically derive dialogue-level AMRs from utterance-level AMRs, by adding cross-utterance links that indicate speakers, identical mentions and co-reference links. One example is shown in the right block of Figure 2, where newly added edges are in color. We consider two main approaches of making use of such dialogue-level AMR structures. For the first method, we merge an AMR with tokens in its corresponding sentence via AMR-to-text alignments, before encoding the resulting structure using a graph Transformer (Zhu et al., 2019). For the second method, we separately encode an AMR and its corresponding sentence, before leveraging both representations via feature fusion (Mangai et al., 2010) or dual attention (Calixto et al., 2017).
We verify the effectiveness of the proposed framework on a dialogue relation extraction task ) and a response generation task (Li et al., 2017). Experimental results show that the proposed framework outperforms previous 1 Please refer to PropBank (Kingsbury and Palmer, 2002;Palmer et al., 2005) for more details. methods (Vaswani et al., 2017;Bao et al., 2020;, achieving the new state-of-the-art results on both benchmarks. Deep analysis and human evaluation suggest that semantic information introduced by AMR can help our model to better understand long dialogues and improve the coherence of dialogue generation. One more advantage is that AMR is helpful to enhance the robustness and has a potential to improve the interpretability of neural models. To our knowledge, this is the first attempt to leverage the AMR semantic representation into neural networks for dialogue understanding and generation. Our code is available at https://github.com/muyeby/AMR-Dialogue.

Constructing Dialogue AMRs
Figure 2 illustrates our method for constructing a dialogue-level AMR graph from multiple utterancelevel AMRs. Given a dialogue consisting multiple utterances, we adopt a pretrained AMR parser (Cai and Lam, 2020) to obtain an AMR graph for each utterance. For utterances containing multiple sentences, we parse them into multiple AMR graphs, and mark them belonging to the same utterance. We construct each dialogue AMR graph by making connections between utterance AMRs. In particular, we take three strategies according to speaker, identical concept and co-reference information.
Speaker We add a dummy node and connect it to all root nodes of utterance AMRs. We add speaker tags (e.g., SPEAKER1 and SPEAKER2) to the edges to distinguish different speakers. The dummy node ensures that all utterance AMRs are connected so that information can be exchanged during graph encoding. Besides, it serves as the global root node to represent the whole dialogue.
Identical Concept There can be identical mentions in different utterances (e.g. "possible" in the first and the forth utterances in Figure 2), resulting in repeated concept nodes in utterance AMRs. We connect nodes corresponding to the same nonpronoun concepts by edges labeled with SAME 2 . This type of connection can further enhance crosssentence information exchange.
Inter-sentence Co-reference A major challenge for dialogues understanding is posed by pronouns,

S1
I'm afraid there has been a mistake.
What could it be? Step 1: parse raw-text utterance into utterance AMR graphs; Step 2: connect utterance AMR graphs into a dialogue AMR graph.
which are frequent in conversations (Grosz et al., 1995;Newman et al., 2008;Quan et al., 2019). We conduct co-reference resolution on dialogue text using an off-to-shelf model 3 in order to identify concept nodes in utterance AMRs that refer to the same entity. For example, in Figure 2, "I" in the first utterance, and "sir" in the second utterance refer to the same entity, SPEAKR1. We add edges labeled with COREF between them, starting from later nodes to earlier nodes (later and earlier here refer to the temporal order of ongoing conversation), to indicate their relation 4 .

Baseline System
We adopt a standard Transformer (Vaswani et al., 2017) for dialogue history encoding. Typically, a Transformer encoder consists of L layers, taking a sequence of tokens (i.e., dialogue history) S = {w 1 , w 2 , ..., w N }, where w i is the i-th token and N is the sequence length, as input and produces vectorized word representations {h l 1 , h l 2 , ..., h l N } iteratively, l ∈ [1, ..., L]. Overall, a Transformer encoder can be written as: where H = {h L 1 , h L 2 , ..., h L n }, and emb denotes a function that maps a sequence of tokens into the corresponding embeddings. Each Transformer layer consists of two sub-layers: a self-attention sub-layer and a position-wise feed forward network. The former calculates a set of attention scores: which are used to update the hidden state of w i : where W V is a parameter matrix. The position-wise feed-forward (FFN) layer consists of two linear transformations: where W 1 , W 2 , b 1 , b 2 are model parameters.

Dialogue Understanding Task
We take the dialogue relation extraction task  as an example. Given a dialogue history S and an argument (or entity) pair (a 1 , a 2 ), the goal is to predict the corresponding relation type r ∈ R between a 1 and a 2 . We follow a previous dialogue relation extraction model (Chen et al., 2020) to feed the hidden states of a 1 and a 2 (denoted as h a 1 , h a 2 ) into a classifier to obtain the probability of each relation types: where W 3 and b 3 are model parameters. The k-th value of P rel is the conditional probability of k-th relation in R. Given a training instance S, a 1 , a 2 , r , the local loss is: where θ denotes the set of model parameters. In practice, we use BERT (Devlin et al., 2019) for calculating h a 1 and h a 2 , which can be regarded as pre-trained initialization of the Transformer encoder. Feature Fusion

Dialogue Response Generation Task
Given a dialogue history S, we use a standard autoregressive Transformer decoder (Vaswani et al., 2017) to generate a response Y = {y 1 , y 2 , ..., y |Y| }. At time step t, the previous output word y t−1 is firstly transformed into a hidden state s t by a selfattention layer as Equations 2 and 3. Then an encoder-decoder attention mechanism is applied to obtain a context vector from encoder output hidden The obtained context vector c t is then used to calculate the output probability distribution for the next word y t over the target vocabulary 5 : where W 4 , b 4 are trainable model parameters. The k-th value of P voc is the conditional probability of k-th word in vocabulary given a dialogue. Given a dialogue history-response pair {S, Y}, the model minimizes a cross-entropy loss: where θ denotes all model parameters.

Proposed Model
Our model takes a dialogue history S and the corresponding dialogue AMR as input. Formally, an AMR is a directed acyclic graph G = V, E , where V denotes a set of nodes (i.e. AMR concepts) and E (i.e. AMR relations) denotes a set of labeled edges. An edge can be further represented by a triple n i , r ij , n j , meaning that the edge is from node n i to n j with label r ij .
We consider two main ways of making use of dialogue-level AMRs. The first method (Figure 3(a)) uses AMR semantic relations to enrich a textual representation of the dialogue history. We project AMR nodes onto the corresponding tokens, extending Transformer by encoding semantic relations between words. For the second approach, we separately encode an AMR and its sentence, and use either feature fusion (Figure 3(b)) or dual attention (Figure 3(c)) to incorporate their embeddings.

Graph Encoding
We adopt a Graph Transformer (Zhu et al., 2019) to encode an AMR graph, which extends the standard Transformer (Vaswani et al., 2017) for modeling structural input. A L-layer graph Transformer takes a set of node embeddings {n 1 , n 2 , ..., n M } and a set of edge embeddings {r ij |i ∈ [1, ..., M ], j ∈ [1, ..., M ]} as input 6 and produces more abstract The key difference between a graph Transformer and a standard Transformer is the graph attention layer. Compared with selfattention layer (Equation 2), the graph attention layer explicitly considers graph edges when updating node hidden states. For example, give an edge n i , r ij , n j , the attention scoreα ij is calculated where W R is a transformation matrix, r ij is the embedding of relation r ij , d is hidden state size, The hidden state of n i is then updated as: where W V is a parameter matrix. Overall, given an input AMR graph G = V, E , the graph Transformer encoder can be written as .., h L M } denotes top-layer graph encoder hidden states.

Enriching Text Representation
We first use the JAMR aligner (Flanigan et al., 2014) to obtain a node-to-word alignment, then adopt the alignment to project the AMR edges onto text with following rules: where A is a one-to-K alignment (K ∈ [0, . . . , N ]). In this way, we obtain a projected graph G = V , E , where V represents the set of input words {w 1 , w 2 , ..., w N } and E denotes a set of word-to-word semantic relations.
Inspired by previous work on AMR graph modeling (Guo et al., 2019;Song et al., 2019b;Sun et al., 2019), we adopt a hierarchical encoder that stacks a sequence encoder and a graph encoder. A sequence encoder (SeqEncoder) transforms a dialogue history into a set of hidden states: A graph encoder incorporates the projected relations features into H S : In addition, we add a residual connection between graph adapter and sequence encoder to fuse word representations before and after refinement (as shown in Figure 3(b)): where LayerNorm denotes the layer normalization (Ba et al., 2016). We name the hierarchical encoder as Hier, which can be used for both dialogue understanding and dialogue response generation.

Leveraging both Text and Structure Cues
We consider integrating both text cues and AMR structure cues for dialogue understanding and response generation, using a dual-encoder network. First, a sequence encoder is used to transform a dialogue history S into a text memory (denoted by a graph Transformer encoder using Equation 12. For dialogue understanding (Figure 3(b)) and dialogue response generation (Figure 3(c)), slightly different methods of feature integration are used due to their different nature of outputs. Dialogue Understanding. Similar to Section 4.2, we first use the JAMR aligner to obtain a node-toword alignment A. Then we fuse the word and AMR node representations as follows: where h ∅ is the vector representation of the dummy node (see Figure 2), f is defined as: The fused word representations are then fed into a classifier for relation prediction (Equation 5). Dialogue Response Generation. We replace the standard encoder-decoder attention (Equation 7) with a dual-attention mechanism (Song et al., 2019a). In particular, given a decoder hidden state s t at time step t, the dual-attention mechanism calculates a graph context vector c S t and a text context vector c G t , simultaneously: where W c and b c are model parameters.
We name the dual-encoder model as Dual.

Dialogue Understanding Experiments
We evaluate our model on DialogRE , which contains totally 1,788 dialogues, 10,168 relational triples and 36 relation types in total. On average, a dialogue in DialogRE contains 4.5 relational triples and 12.9 turns. We report experimental results on both original (v1) and updated (v2) English version. 7

Settings
We adopt the same input format and hyperparameter settings as   token is fed into a classifier for prediction, while our baseline (BERT c ) additionally takes the hidden states of a 1 and a 2 . All hyperparameters are selected by prediction accuracy on validation dataset (See Table 6 for detailed hyperparameters).
Metrics Following previous work on DialogRE, we report macro F1 score on relations in both the standard (F1) and conversational settings (F1 c ; . F1 c is computed over the first few turns of a dialogue where two arguments are first mentioned. 7 https://dataset.org/dialogre/  (Nan et al., 2020) and DHGAT (Chen et al., 2020). BERT c and Hier, Dual represent our baseline and the proposed models, respectively. By incorporating speaker information, BERT s gives the best performance among the previous system. Our BERT c baseline outperforms BERT s by a large margin, as BERT c additionally considers argument representations for classification. Hier significantly (p < 0.01) 8 outperforms BERT c in all settings, with 1.4 points of improvement in terms of F1 score on average. A similar trend is observed under F1 c . This shows that semantic information in AMR is beneficial to dialogue relation extraction, since AMR highlights core entities and semantic relations between them. Dual obtains slightly better results than Hier, which shows effect of separately encoding a semantic structure.

Main Results
Finally, the standard deviation values of both Dual and Hier are lower than the baselines. This indicates that our approaches are more robust regarding model initialization.

Impact of Argument Distance
We split the dialogues of the DialogRE (v2) devset into five groups by the utterance-based distance between two arguments. As shown in Figure 4, Dual gives better results than BERT c except when the argument distance is less than 5. In particular, Dual surpasses BERT c by a large margin when the arguments distance is greater than 20. The comparison indicates that AMR can help a model to better handle long-term dependencies by improving the entity recall. In addition to utterance distance, we also consider word distance and observe a similar trend (as shown in Appendix 7). Figure 5 shows a conversation between a manager and an employee who might have taken a leave. The baseline model incorrectly predicts that the relation between two interlocutors is parent and child. It might be influenced by the last sentence in the conversation, assuming that it is a dialogue between family members. However, the proposed model successful predicts the interlocutors' relation, suggesting it can extract global semantic information in the dialogue from a comprehensive perspective.

Response Generation Experiments
We conduct experiments on the DailyDialog benchmark (Li et al., 2017), which contains 13,119 daily multi-turn conversations. On average, the number of turns for each dialogue is 7.9, and each utterance has 14.6 tokens.

Settings
We take Transformer as a baseline. Our hyperparameters are selected by word prediction accuracy on validation dataset. The detailed hyperparameters are given in Appendix (See Table 6). Metric We set the decoding beam size as 5 and adopt BLEU-1/2/3/4 (Papineni et al., 2002) and Distinct-1/2 (Li et al., 2016) as automatic evaluation metrics. The former measures the ngram overlap between generated response and Dialogue : SPEAKER-1: A new place for a new Ross. I'm gonna have you and all the guys from work over once it's y'know, furnished. SPEAKER-2: I must say it's nice to see you back on your feet. SPEAKER-1: Well I am that. And that whole rage thing is definitely behind me. SPEAKER-2: I wonder if its time for you to rejoin our team at the museum? SPEAKER-1: Oh Donald that-that would be great. I am totally ready to come back to work. I…What? No! Wh-What are you doing?!! GET OFF MY SISTER!!!!!!!!!!!!! Ground-Truth: per:boss(S1, S2) Baseline: per:parent(S1, S2) Ours: per:boss(S1, S2)   the target response while the latter assesses the generation diversity, which is defined as the number of distinct uni-or bi-grams divided by the total amount of generated words. In addition, we also conduct human evaluation. Following Bao et al. (2020), we ask annotators who study linguistics to evaluate model outputs from four aspects, which are fluency, coherence, informativeness and overall performance. The scores are in a scale of {0, 1, 2}. The higher, the better. Table 2 reports the performances of the previous state-of-the-art methods and proposed models on the DailyDialog testset. For the previous methods, PLATO and PLATO w/o L are both Transformer models pre-trained on large-scale conversational data (8.3 million samples) and finetuned on Dai-lyDialog. For completeness, we also report other systems including Seq2Seq (Vinyals and Le, 2015) and iVAE MI (Fang et al., 2019).  Among the previous systems, PLATO and PLATO w/o L report the best performances. Our Transformer baseline is highly competitive in terms of BLEU and Distinct scores. Compared with the Transformer baseline, both Dual and Hier show better numbers regarding BLEU and Distinct, and the gains of both models are significant (p < 0.01). This indicates that semantic information in AMR graphs is useful for dialogue response generation. In particular, the gains come from better recall of the important entities and their relations in a dialogue history, which can leads to generating a more detailed response.

Human Evaluation Results
We conduct human evaluation on randomly selected 50 dialogues and corresponding generated responses of the baseline and our models. As shown in Table 3, the Transformer baseline gives the lowest scores, while Dual sees the highest scores from all aspects. Our main advantage is on the Coherence, meaning that AMRs are effective on recalling important concepts and relations. As the result, it makes it easier for our models to generate coherent replies. Examples are shown in Figure 8 in Appendix. Comparatively, all systems achieve high scores regarding Fluency, suggesting that this aspect is not the current bottleneck for response generation.

Analysis
This section contains analysis concerning the effects of graph features, dialogue length and model robustness. We use Dual model for experiments since it gives slightly better results than Hier. Table 4 shows the results of our best performing models on the two datasets regarding different configurations on the dialogue AMR graphs. We report the average F1 score for DialogRE and the BLEU-1/Distinct-1 score for DailyDialog. First, using utterance-level AMR improves the text baseline by 1.2 points and 1.5 points with regard to F1 and   BLEU-1 scores, respectively. This indicates that the semantic knowledge in formal AMR is helpful for dialogue modeling.

Ablation on AMR graph
Second, our manually added relations (in Section 2) also leads to improvements, ranging from 0.5 to 1.0 in BLEU-1 score. The speaker relation is the most important for dialogue relation extraction, a possible reason is that DialogRE dataset mainly focus on person entities. Also, co-reference relations help the most in dialogue response generation. The identical concept relations give least improvements among three relations. Finally, combining all relations to build a Dialog-AMR graph achieves best performance on both datasets.

Impact of Dialogue Length
We group the devset of DialogRE (v2) and Daily-Dialog into five groups according to the number of utterances in a dialogue. Figure 6 summarizes the performance of the baseline and the proposed model on dialogue understanding (DU) and response generation (RG) tasks. In dialogue understanding, our model gives slightly better F1 scores than the baseline when a dialogue has smaller than 12 utterance. The performance improvement is more significant when modeling a long dialogue. This confirms our motivation that AMR can help to understand long dialogues. In dialogue response generation, our model consistently outperforms the Transformer baseline by a large margin on

Robustness Against Input
Recent studies show that neural network-based dialog models lack robustness (Shalyminov and Lee, 2018;Einolghozati et al., 2019). We select 100 instances from the testset of DialogRE (v2) where both baseline and our model gives true prediction, before paraphrasing the source dialogues manually (see appendix B.3 for paraphrasing guidelines.). Results on the paraphrased dataset are given in Table 5. The performance of baseline model drop from 100 to 94.5 on paraphrased dataset. By contrast, the result of our model reaches 98.5, 4 points higher than baseline. This confirms our assumption that AMR can reduce data sparsity, thus improve the robustness of neural models.

Related Work
Semantic Parsing for Dialogue Some previous work builds domain-specified semantic schema for task-oriented dialogues. For example, in the PEGASUS (Zue et al., 1994) system, a sentence is first transformed into a semantic frame and then used for travel planing. Wirsching et al. (2012) use semantic features to help a dialogue system perform certain database operations. Gupta et al. (2018) represent task-oriented conversations as semantic trees where intents and slots are tree nodes. They solve intent classification and slot-filling task via semantic parsing. Cheng et al. (2020) design a rooted semantic graph that integrates domains, verbs, operators and slots in order to perform dialogue state tracking. All these structures are designed for specified task only. In contrast, we investigate a general semantic representation for the modeling of everyday conversations.

Constructing AMRs beyond Sentence Level
There are a few attempts to construct AMRs beyond the sentence level. Liu et al. (2015) construct document-level AMRs by merging identical concepts of sentence-level AMRs for abstractive summerization, and Liao et al. (2018) further extend this approach to multi-document summerization. O'Gorman et al. (2018) manually annotate co-reference information across sentence AMRs. We focus on creating conversation-level AMRs to facilitate information exchange more effectively for dialogue modeling. Bonial et al. (2020) adapt AMRs on dialogues by enriching the standard AMR schema with dialogue acts, tense and aspect, and they construct a dataset consisting of 340 dialogue AMRs. However, they propose theoretical changes in the schema for annotating AMRs, while we explore empirical solutions that leverage existing AMRs of the standard schema on dialogues.

AMR Parsing and Encoding
Our work is also related to AMR parsing (Flanigan et al., 2014;Konstas et al., 2017a;Lyu and Titov, 2018;Guo and Lu, 2018;Cai and Lam, 2020) and AMR encoding (Konstas et al., 2017b;Song et al., 2018;Zhu et al., 2019;Zhao et al., 2020;Bai et al., 2020). The former task makes it possible to use automatically-generated AMRs for downstream applications, while the latter helps to effectively exploit structural information in AMRs. In this work, we investigate AMRs for dialogue representation and combine AMRs with text for dialogue modeling.

Conclusion
We investigated the feasibility of using AMRs for dialogue modeling, describing an algorithm to construct dialogue-level AMRs automatically and exploiting two ways to incorporate AMRs into neural dialogue systems. Experiments on two benchmarks show advantages of using AMR semantic representations model on both dialogue understanding and dialogue response generation.  Table 6 lists all model hyperparameters used for experiments. In particular, we share the word vocabulary of encoder and decoder for response generation. We implement our baselines and proposed model based on Pytorch. The preprocessed data and source code will be released at https: //github.com/muyeby/AMR-Dialogue.

B.1 Impact of Argument Distance
In addition to utterance distance used in Figure 4, we also consider word-based distance as a metric to measure argument distance. Figure 7 shows F1 scores of baseline and our model on 5 groups of test instances. It can be seen that our model gives better results than baseline system among all distances longer than 30. In particular, our model surpass baseline by 8 points when argument distance is longer than 120.
Dialogue History: … SPEAKER-1 : We have new room rates, sir. Will that be acceptable to you? SPEAKER-2 : Well , it depends on the price, of course. What is it? SPEAKER-2 : It's $ 308 a night. SPEAKER-1 : I have no problem with that. SPEAKER-2 : Great! Would you prefer smoking or nonsmoking? SPEAKER-1 : Definitely nonsmoking. I can't handle that smell.
Ground-Truth: Now, is a queen-size bed okay?
Transformer: I'm sorry, sir. I'll be fine. Ours: That'll be nonsmoking. Now, do you prefer a single queen-size bed? B.2 Case Study for Dialogue Response Generation Figure 8 represents a conversation between a hotel service and a guest who wants to book a room, along with its ground-truth response and model-generated responses. We can observe that Transformer's output is general and not consistent with dialogue history. While proposed models' outputs can capture the core information "room" from the history, and are more relevant to the topic. Besides, the output given by proposed model is semantically similar to the ground-truth output, but using novel words to response, indicating that the model not only captures the simple dependency between input and output sentences, but also learns deep semantic information of the dialogue history.

B.3 Paraphrasing Guidelines
We ask annotators to paraphrase the dialogues following 3 guidelines: • do not change the original meaning.
• paraphrase the sentence by using different lexicon and syntax structures.
• paraphrase the dialogue as much as they can. We also ask a judge to evaluate whether the paraphrased dialogue (sentences) convey the same meaning of the original ones.