Intention Reasoning Network for Multi-Domain End-to-end Task-Oriented Dialogue

Recent years has witnessed the remarkable success in end-to-end task-oriented dialog system, especially when incorporating external knowledge information. However, the quality of most existing models’ generated response is still limited, mainly due to their lack of fine-grained reasoning on deterministic knowledge (w.r.t. conceptual tokens), which makes them difficult to capture the concept shifts and identify user’s real intention in cross-task scenarios. To address these issues, we propose a novel intention mechanism to better model deterministic entity knowledge. Based on such a mechanism, we further propose an intention reasoning network (IR-Net), which consists of joint and multi-hop reasoning, to obtain intention-aware representations of conceptual tokens that can be used to capture the concept shifts involved in task-oriented conversations, so as to effectively identify user’s intention and generate more accurate responses. Experimental results verify the effectiveness of IR-Net, showing that it achieves the state-of-the-art performance on two representative multi-domain dialog datasets.


Introduction
Task-oriented dialogue systems are designed to help users to achieve specific goals such as schedule arrangement or weather inquiry via natural language. Compared with traditional pipeline dialogue systems (Young et al., 2013) that include multiple modules each requiring a huge amount of human effort to design, end-to-end approaches (Gülçehre et al., 2016;Wen et al., 2017;Zhao et al., 2017;Quan et al., 2019;Moon et al., 2019;Jung et al., 2020;Dai et al., 2020) that can directly output system responses with plain text as input have recently gained much attention. In recent years, sequence to sequence (Seq2Seq) models have dominated the study of * Corresponding author. Figure 1: Example of task-oriented dialog including concept shifts from the SMD dataset . The solid arrows indicate the existed relationships between entities, and the dotted arrows indicate the latent entity relationships (captured by IR-Net). The colored dotted boxes in the dialog indicate generated entities. end-to-end task-oriented dialog systems, and many memory augmented Seq2Seq models have been proposed (Bordes et al., 2017;Wu et al., 2019;Qin et al., 2019;Reddy et al., 2019;Wang et al., 2020), which exploit both dialog history and domain-specific knowledge base (KB) to incorporate KB information and perform knowledge-based reasoning for better performance.
Though achieving remarkable progress, existing memory augmented Seq2Seq models still suffer from the following two limitations. First, prior models rely heavily on the soft attention mechanism (Vaswani et al., 2017) to generate responses by adopting a weighted sum over the embeddings of memory triples (from both dialog history and external KB) as the output representation. Since the representation acquired in this way is scattered by the context, it is difficult to model deterministic knowledge w.r.t. specific conceptual tokens. Take the dialog in Figure 1 as an example, when answering the user's query on today's (Monday's) weather, the response generated by existing Seq2Seq models may be ambiguous or even incorrect due to the impact of contextual triples such as (Tuesday, weather, sunny) and (Wednesday,weather, cloudy). Second, the soft attention mechanism is inherently not suitable for performing fine-grained (tokenlevel) multi-hop reasoning, which makes it hard to capture user's real intention to generate accurate responses, especially in complex cross-task scenarios where concept shifts  may occur. For example, in Figure 1, when the system is asked "please give me the specific address for the dinner", it is expected to explore the pivot "john's_home" that connects the start token "diner" (in Schedule domain) with the target token "550_Alester_Ave" (in Navigate domain), and finally return the answer "550_Alester_Ave". Existing attention-based models generally fail to perform such a token-level multi-hop reasoning, which hampers them from obtaining accurate responses.
To address the aforementioned limitations, we propose a novel Intention Reasoning Network (IR-Net), which is a memory-augmented Seq2Seq model equipped with an intention reasoning module that is responsible for obtaining an intentionaware representation, with the goal of generating more accurate responses. Specifically, to address the first limitation, we propose a novel intention mechanism (Sec. 2.3.1), which directly incorporates the tail-token of a knowledge triple by comparing the similarity between the query vector and the triple's head-token to model deterministic knowledge. Based on the intention mechanism, we further address the second limitation by proposing an intention reasoning module that consists of token-level joint reasoning and multi-hop reasoning (Sec. 2.3.2), which are responsible for capturing specific target information from breadth and depth respectively to generate intention-aware representations, so as to improve the integrality and accuracy of the generated responses.
We conduct experiments on two publicly available multi-domain datasets, namely SMD  and Multi-WOZ 2.1 (Budzianowski et al., 2018). The experimental results show that IR-Net consistently outperforms the current stateof-the-art models in both automatic and human evaluation. To our best knowledge, we are the first to effectively explore fine-grained token-level in-tention reasoning in multi-domain end-to-end taskoriented dialog.

Model Description
Our proposed model is based on a Seq2Seq dialog generation model (Sec. 2.1), which encodes dialogue history X and knowledge base B and ultimately obtains a response sequence Y . An external memory module M = [X; B] is set up for knowledge query (Sec. 2.2). Moreover, to capture potential concept shift and user's intention to generate a fluid and intention-aware response, an intention reasoning module based on a novel intention mechanism is proposed (Sec. 2.3). The workflow of our proposed model is depicted in Figure 2.

Seq2Seq Dialogue Generation
We define the Seq2Seq dialogue generation task as generating the most likely response sequence Y = {y 1 , y 2 , · · · , y n }, giving the input with multiple rounds of dialogue history X and knowledge base B. The probability of a response can be formally defined as, where y t represents the current output token. Different from the vanilla Seq2Seq dialogue generation model , we use p C (y t ) to denote the probability that the generated token y t is a conceptual token within M , and p C (y t ) to denote the probability that y t is a general token. Finally, we choose the highest probability to generate the token y t at time t.

Contextual Dialog History Encoder
In order to overcome the challenge of modeling long dialogue text, we encode dialog history utterances round by round. We first encode every sentence pair (Q p , Y p ) ∈ X as a semantic representation, where Q p and Y p respectively represent the p-th round question sequence (with m tokens) and response sequence (with n tokens). To better encode the contextual information of the dialogue, we send (Q p , Y p ) into an effective pre-trained language representation model BERT (Devlin et al., 2019) to get the representation for the p-th round dialog sequence,  where H p 1:m denotes the representation for the question sequence, and H p m+1:m+n for the response sequence. Afterward, we fed H p 1:m+n into a Bidirectional Long Short-Term Memory network (BiLSTM) (Hochreiter and Schmidhuber, 1997) to produce contextual hidden states h enc = (h enc,1 , h enc,2 , . . . , h enc,m+n ), where, Note that the first hidden state will be initialized with the last hidden state of the previous round, i.e., h p enc,0 = h p−1 enc,m+n (the superscript p is omitted if no confusion occurs in the following text).

Hierarchical Response Decoder
We exploit a hierarchy mechanism to decode the response sequence. Specifically, when decoding y t , we use a coarse-grained LSTM decoder and a fine-grained LSTM decoder to compute the probability simultaneously.
We first use a coarse-grained decoder. Given (h enc,1 , h enc,2 , . . . , h enc,m+n ), an LSTM is used to repeatedly predict outputs (y 1 , y 2 , . . . , y t−1 ) by the decoder hidden states (h dec,1 , h dec,2 , . . . , h dec,t ). For the generation of y t , we first calculate an attentive representation h dec,t of the dialogue history over the hidden state h enc , and then concatenate it with h dec,t to get the context-aware output repre-sentation, where o C,t is the score (logit) for the next token generation, and W 1 is a trainable parameter. The probability of the next word y t being regarded as a general token is then calculated as follows, Next, we use a fine-grained decoder. In addition to incorporating h dec,t to ensure the relevance between the generated response and the question, we further derive an intention-aware representation I dec,t (which will be detailed in Sec. 2.3) to enhance the representation of the target entity for generating more accurate response. By concatenating h dec,t with h dec,t and I dec,t , we can get the output representation as, The probability of y t being regarded as a conceptual token is then calculated as follows: Finally, we can get the probability of y t as,

External Knowledge Memory
As well known, the successful conversations for task-oriented dialogue system heavily depend on accurate knowledge queries. We build our external knowledge memory M based on two parts: dialogue history X and multi-domain knowledge base B, i.e., M = [X; B] = (m 1 , m 2 , . . . , m l ). Each entity in M is represented in a triple format, i.e., m i = (h, r, t). To better encode the external knowledge to make it more suitable for multi-hop reasoning and vector calculation, we embed the knowledge triples into a word vector space rich in strong entity relationships and semantic shift information. Specifically, for each triple m i = (h, r, t) ∈ M , we use the TransR model (Lin et al., 2015) to perform fine-grained representation learning and obtain (e h , e r , e t ) as the memory embeddings. More details about TransR learning can be found in Appendix A.2.
To integrate knowledge information into the end-to-end dialogue system, the memory network (MN) (Sukhbaatar et al., 2015) is adopted to store global cross-domain knowledge, which is shared between the encoder and the decoder. For a khop MN, the external knowledge is composed of a set of trainable embedding matrices C = C 1 , . . . , C k+1 .

Query Knowledge in Encoder
We use the last hidden state as the initial query vector: It can loop over k hops and compute the attention weights at each hop k using where c k i is the embedding in i-th memory position using the embedding matrix C k , and q k enc is the query vector for hop k. Finally, the model reads out the memory o k enc by the weighted sum over c k+1 i and updates the query vector q k+1 enc . Formally, where q k+1 enc is a coarse-grained representation containing KB information, and can be used to initialize the coarse-grained LSTM decoder.
By the above steps, we can obtain a global memory pointer G = (g 1 , . . . , g l ) to filter out worthless external knowledge for further decoding, where, Note that G is finally trained as a n-dimensional 0/1 prediction vector, and its training details are shown in Appendices A.3 and A.4.

Query Knowledge in Decoder
Recall that we adopt two LSTMs as the decoder. For the coarsegrained LSTM decoder, following Wu et al. (2019) and Qin et al. (2020), we use the concatenation of h dec,t (initialized by q k+1 enc ) with the attentive representation h dec,t to query knowledge.
For the fine-grained LSTM decoder, we use the concatenation of the hidden states h dec,t (initialized by q k+1 dec obtained when generating the previous conceptual word), the attentive representation h dec,t and the intention-aware representation I dec,t , to query knowledge. Formally, Instead of selecting the maximum p k i to generate y t , we read out the memory o k dec by the weighted sum over c k+1 and update the query vector q k+1 dec , Note that q k+1 dec is a fine-grained representation containing user intention, and can be fed to the finegrained LSTM decoder for the next conceptual word generation.

Intention Reasoning Module
To obtain intention-aware presentation I dec,t , we first propose a novel intention mechanism (Sec. 2.3.1), based on which we further propose a fine-grained intention reasoning module (Sec. 2.3.2) that includes joint reasoning and multihop reasoning.

Intention Mechanism
Previous works usually use soft attention mechanisms (Vaswani et al., 2017) to calculate a weighted sum of all the knowledge based on the whole vector of each triple, which may not be conducive to generating accurate task-oriented responses. To address this issue, we propose a new intention mechanism to directly incorporate tail-entity information by comparing the similarity between query vector and the head-entity, which is formally defined as, (16) where q is query vector, and (e h , e r , e t ) represents the representation of the selected knowledge triple. Note here φ denotes for similarity score function, such as cos(·), dot product and scaled dot-product. We have tried these three functions and finally chose cos(·) based on their performance.

Fine-grained Intention Reasoning
Based on the intention mechanism, we further perform fine-grained intention reasoning to obtain an intention-aware representation I dec,t , which can be used to capture the concept shift information for final response generation. Specifically, giving the encoder query vector q enc,s (i.e., h enc,s ), the decoder query vector q dec,t (i.e., h dec,t ) and the global memory pointer G, I dec,t is obtained by performing joint reasoning and multi-hop reasoning sequentially. Note that before conducting intention reasoning, we first use G to filter the external knowledge to obtain the target triples.
Joint Reasoning This operation is used to improve the integrality of the generated responses. Specifically, for multiple knowledge triples with the same head entity (or same tail entity), we fuse them into a single triple. Take the triples e s , e 1 r , e 1 t and e s , e 2 r , e 2 t in Figure 3(a) as an example, the joint reasoning is conducted as, where W t and W r are trainable weight matrices. Then, (e s , e r , e t ) can be regarded as a new triple for multi-hop reasoning below.
Multi-hop Reasoning This operation aims at improving the accuracy of the generated responses. Specifically, an intention weight γ t,s is calculated to evaluate the probability that a set of ordered triples can generate the optimal reasoning chain. As shown in Figure 3(b), after filtering by G, there are two triples: (e s , e r , e s ) and (e s , e r , e t ). Suppose we perform 2-hop reasoning here, then there are totally 2 2 possible chains, and their intention weights can be calculated as follows: Finally, we choose max γ i t,s as the final γ t,s . Note that the above procedures can be generalized to L hops, where L is a model hyper-parameter.
After performing L-hop reasoning, we can get γ t,s , and the corresponding optimal reasoning chain that contains L ordered triples, denoted by {(e 1 h , e 1 r , e 1 t ), . . . , (e L h , e L r , e L t )} (note duplicate may occur when the number of target triples is less than L). Finally, we can obtain the intention-aware representation as, I dec,t =W (1) Intention q enc,s , (e 1 h , e 1 r , e 1 t ) where W (1) and W (i) are trainable parameters that are used to weigh the tail-token information obtained from the reasoning chain.

Degeneration
Note that when the encoded and decoded word is a general word, our model will no longer perform joint and multi-hop reasoning. Accordingly, the intention weight is reduced to the attentive weight: where α t,s represents the attentive weight. This means the intention mechanism actually degenerates to the attention mechanism, which proves the robustness of our model.

Datasets and Metrics
Two publicly available datasets: SMD  and an extended version of Multi-WOZ 2.1 (Qin et al., 2020), are used to evaluate the performance of our model. We follow ,  and Wu et al. (2019) to partition SMD, and follow Budzianowski et al. (2018) and Qin et al. (2020) to partition Multi-WOZ 2.1. The statistics of the datasets after partition are presented in Table 1.
Follow several previous work Wu et al., 2019;Qin et al., 2019Qin et al., , 2020, we use BLEU and F1 (including both macro-F1 and micro-F1) to evaluate our model versus existing models. Moreover, to evaluate the performance in a more fine-grained level, we also choose Rouge-1 and Rouge-2 as metrics.

Baselines
We compare our model with the following state-ofthe-art baselines.
• Mem2Seq ) 1 : the model takes dialog history and KB entities as input and utilizes a pointer gate to control either generating a vocabulary word or copying an entity word.
• KB-retriever (Qin et al., 2019) 2 : the model adopts a retriever module to extract the most relevant knowledge items and filter irrelevant information for response generation.
• GLMP (Wu et al., 2019) 3 : the model adopts a global-to-local pointer to query knowledge, where the global memory pointer is used to filter the external KB information, and the local memory pointer is used to instantiate a slot value generated by a sketch RNN.
• DF-Net (Qin et al., 2020) 4 : the framework uses a dynamic fusion network to dynamically exploit the correlation between all domains for fine-grained knowledge transfer and achieves state-of-the-art performance.  For BLEU and micro-F1 scores of the above baselines, we adopt the reported results from Wu et al. (2019) and Qin et al. (2020). For macro-F1 and Rouge scores, we rerun their public code to obtain results on same datasets.

Implementation Details
We train our model end-to-end by using Adam optimizer (Kingma and Ba, 2015) and choose the learning rate between [1e −3 , 1e −4 ]. The loss functions are described in Appendix A.4. The dropout ratio is selected from {0.1, 0.15, 0.2, 0.25, 0.3} and the batch size from {8, 16, 32}. The hyper-parameters such as hidden size, dropout, batch size, and embedding dimensionality are all tuned with grid-search over the development set. All experiments are conducted with PyTorch and our adopted BERT inherits huggingface's implementation 5 . Appendix A.1 presents more details about hyper-parameters.

Response Quality Evaluation
Automatic Evaluation Follow the prior work Wu et al., 2019;Qin et al., 2020;, we evaluate model performance automatically from two aspects: relevancy and novelty, where the corresponding results are presented in Tables 2 and 3, respectively. From Table 2, we can observe that our model IR-Net achieves the state-of-the-art performance on two multi-domain datasets SMD and Multi-WOZ 2.1. Specifically, On SMD dataset, IR-Net exhibits the highest BLEU compared with other baselines, indicating that our model can generate responses closer to the golden ones. Moreover, our model outperforms DF-Net, a recent model that can capture the correlation between domains for fine-grained knowledge transfer, by 2.6% and 0.5% on macro-F1 and micro-F1 respectively, which verifies the effectiveness of our intention reasoning model in capturing the concept shifts across multiple domains to generate more accurate and appropriate responses. On Multi-WOZ 2.1, a trend for a similar performance improvement can be observed, which further demonstrates the effectiveness of our model.   From Table 3, we can see that compared with other baselines, IR-Net achieves consistently lower BLEU and Rouge scores, which demonstrates its capability in generating more innovative responses, possibly due to the following two reasons: 1) The integration of cross-domain knowledge in multihop reasoning makes the generated responses more diverse; 2) The hierarchical LSTM decoder in IR-Net can learn more forms of expressions.

Human Evaluation
The human evaluation mainly focuses on six aspects: helpfulness, appropriateness, correctness, fluency, friendliness, and human-likeness, which are all important for task-oriented dialogue systems (Zhou et al., 2018;Qin et al., 2020). We first randomly selected 100 dialogs 1:1 from the SMD and Multi-WOZ 2.1 datasets, and used different models to generate responses, including Mem2Seq, GLMP, DF-Net and IR-Net. Then, we hired human experts to score the responses and golden responses on a scale from 1 to 5, which simulated a real-life taskoriented conversation scenario. By calculating the average score of the above metrics, we obtained the final manual evaluation result, as shown in Table 4. It can be seen that IR-Net outperforms the other three models on all metrics, which is consistent with the results of automatic evaluation.

Ablation Study
In this part, we perform ablation experiments to evaluate the effectiveness of each component. We focused on four crucial components and set them accordingly: 1) w/o IR module and Fine-grained   Decoder denotes that we remove the intention reasoning module and the fine-grained decoder, and just adopt the "coarse-grained decoder" with querying external KB attentively; 2) w/o Coarse-grained Decoder denotes that we only use attentive KB to return answer; 3) w/o Bert Embedding denotes that we simply feed randomly initialized embeddings into the contextual dialog encoder; 4) w/o TransR Training denotes that we discard the TransR-based knowledge triple embedding learning. From the results in Table 5, we can observe that removing each component will result in a performance degradation. In particular, w/o Intention Reasoning and Fine-grained Decoder causes 2.3% drops in entity F1 score, which further verifies the effectiveness of our model.

Case Study and Visualization
We take the dialog in Table 6   model is more adept at mining potential reasoning chains, while previous attention-based model, which is limited by scattered attention weights, is hard to capture explicit reasoning relations. Specifically, for this dialog, IR-Net first performs joint reasoning to derive triple 3 by triples 1 and 2. Then, it performs a 2-hop reasoning to obtain a set of intention weights, as shown in Figure 4 (a). Unlike scattered attention weights in Figure 4 (b), it is clear to see that chain 0 → 3 achieves the highest intention weight (0.8801) in Figure 4 (a), indicating that (today, date, Monday)→(Monday temperature, 20f-30f) has been mined by IR-Net to be the optimal 2-hop reasoning chain. Finally, IR-Net can generate a relatively accurate response "today's temperature is 20f-30f". More analyses and experimental details regarding the visualization of intention and attention weights can be found in Appendix B.2.

Related Work
Sequence to sequence approaches, which use an encoder-decoder structure to capture the contextual dialog semantics and generate responses directly, have recently gained much attention in task-oriented dialogue systems (Zhao et al., 2017). These models have effective language modeling ability, but cannot work well in KB retrieval, even with sophisticated attention-based mechanism. To alleviate this problem, copy augmented Seq2Seq models (Gülçehre et al., 2016; have been adopted, but still suffer from the challenge of performing reasoning over KB triples. To address this problem, memory augmented Seq2Seq models, such as end-to-end Memory Network (Bordes et al., 2017) and DQMN , have been proposed and shown promising results. Later, Mem2Seq  and GLMP (Wu et al., 2019) further augmented memory based methods by incorporating the copy mechanism (Gülçehre et al., 2016), which enables   Table 6. copying words from both dialog history and KB. DSR  proposed to leverage dialogue state representation to retrieve the KB implicitly. Multi-level memory model (Reddy et al., 2019) represented the KB results with a multilevel memory instead of the form of triples. KBretriever (Qin et al., 2019) adopted a KB retriever module to extract the most relevant knowledge items and improve the consistency of generated entities. DDMN (Wang et al., 2020) adopted a dual dynamic memory network to track the dialog context and KB triples respectively. DF-Net (Qin et al., 2020) introduced a dynamic fusion model to capture the correlation between domains for finegrained knowledge transfer. Different from existing models that rely on the soft attention mechanism to perform coarse-grained reasoning, our IR-Net can model more deterministic knowledge and capture the entity (or concept) shift by performing fine-grained token-level reasoning based on the intention mechanism. To our best knowledge, we are the first to effectively explore fine-grained token-level reasoning in multi-domain task-oriented dialog generation.

Conclusion
In this paper, we propose a novel intention mechanism to directly incorporate the tail-token information of a knowledge triple to better model deterministic knowledge for multi-domain task-oriented dialog. Moreover, based on the intention mechanism, we further propose an intention reasoning module that consists of token-level joint reasoning and multi-hop reasoning to obtain an intention-aware representation, aiming at improving the integrality and accuracy of the generated response. Experiments on two publicly available multi-domain datasets demonstrate the effectiveness and superior performance of our model in both automatic and human evaluation.

A.2 Knowledge Embedding Training
In the KB memory module, each element m i is represented in the triple format as (head, relation, tail), e.g., (dinner, time_is, 7 − pm), which is a commonly used format to represent a knowledge item (Miller et al., 2016;. On the other hand, the dialog history X is stored in the dialogue memory, where the user and temporal encoding are included as in (Bordes et al., 2017) like a triple format, e.g., the first utterance from the user in Figure 1 will be denoted as {($user, turn1, How's), ($user, turn1, the), ($user, turn1, weather), ($user, turn1, today)}. For each triple m i ∈ M , we use the TransR model (Lin et al., 2015) to perform representation learning and obtain (e h , e r , e t ) as the memory embeddings. Specifically, for triple m i = (h, r, t), where h, t ∈ E and r ∈ R ( E and R represent the entity space and relation space, respectively), we first use BERT to pre-train them: (e h , e r , e t ) = BERT(h, r, t) Then, we embed e h and e t into the relation space through a trainable projection matrix M r , where the evaluation function is described as follows: Finally, we minimize the following loss function to get the optimal knowledge triple embedding, where S and S are positive triple set and negative triple set (1:3 selected in both SMD and Multi-WOZ 2.1 datasets) respectively, and λ is the distance between the scores of positive and negative triples.

A.3 Description on the Global Pointer
Follow prior work GLMP (Wu et al., 2019), we employ a global memory pointer to select knowledge and regard it as a multi-label classification problem, that is, selecting k target knowledge triples from n candidate triples. For the training of the global memory pointer G, we first use the sigmoid function to activate the dot product of the query vector and the memory representation, and then convert the multi-label classification problem into n binary classification problems (each predicted value 1/0 represents whether the triplet is selected), and finally, we use the sum of the cross-entropy as the loss function. Therefore, G is regarded as a final n-dimensional 1/0 prediction vector to filter worthless knowledge triples, and its training details are shown in Appendix A.4.

A.4 Loss Function
The loss L used in IR-Net is similar to that of GLMP. We first define G label = (ĝ 1 , . . . ,ĝ l ) by checking whether the object words in the memory exist in the expected system response Y, where m i is one triple in the external knowledge M = [X; B] = (m 1 , m 2 , . . . , m l ) and Object(·) is denoted as getting the object word from a triple. Then, the cross-entropy loss L g between G and G lable can be written as, (25) We exploit a hierarchy mechanism to decode the response sequence. Specifically, when decoding y t , we use a coarse-grained LSTM decoder and a fine-grained LSTM decoder to generate a rough response Y c C = (y c 1 , . . . , y c n ) and a fine-grained response Y f C = (y f 1 , . . . , y f n ), respectively. Their output probabilities are calculated as follows, Then, we calculate standard cross-entropy losses L C and L C as follows: Finally, L is the weighted-sum of three losses: where β g , β C andβ C are hyperparameters. Note that these three weights are initialized equally, i.e., 0.33, 0.33 and 0.33. Then we tune them on the verification set to obtain a better weight setting of 0.39, 0.36 and 0.25.

B.1 Additional experiments
Experiments on Domain-shift In this experiment, we randomly selected 100 examples of knowledge queries in each domain on the SMD test set. By parsing the global memory pointer G, we obtain the distribution of the selected knowledge, as shown in Figure 5. We can find that: (1) A small fraction of knowledge query successfully implements cross-domain knowledge-selection through the attention mechanism, while the majority of knowledge is selected within the domain. It means that cross-domain knowledge query occurs in the task-oriented dialogue.
(2) Navigation-related query selects more knowledge in the schedule domain than in the weather domain. Similarly, schedule-related query also selects more knowledge in the navigation domain than in the weather domain. This indicates that the navigation domain and the schedule domain are more closely related.
Analysis on L-hops To analyze the impact of the hop number L in intention reasoning, we keep other hyper-parameters unchanged, and vary L in the range of [1,2,3,4,5,6,7]. From Figure 6, we can observe that with the increase of L, the entity (micro) F1 score first increases and then decreases, and reaches the best result at L = 3, less or more hops would decrease the performance. It is straightforward that less hops are insufficient to capture user's real intention, while too more hops may also lead to more noisy, which is harmful to the expressiveness of the obtained intention-aware representations. Hence, it is necessary to choose appropriate hops for intention reasoning.

B.2 Visualization of Attention and Intention Weights
To further illustrate what our intention reasoning module has learned, we visualize the attention and intention weights (denoted by α and γ respectively) of the dialog generation process in dialog #1 and #2, as shown in Figure 7 (note that only parts of knowledge triples are presented). Darker colors represent higher attention or intention weights. G represents for (0, 1) distribution vector generated by α. From Figure 7, we can observe that: 1) There is a joint reasoning guided by γ, i.e., "today temperature is 20f-30f" by combining (monday, low_temp, 20f ) with (monday, high_temp, 30f ); 2) There are two 2-hop reasoning guided by γ, one is (today, date, monday)→(monday, temperature, 20f -30f ), and the other is (friends_house, poi, jills_house)→(jills_house, address, 347_alta_mesa_ave). The above two observations illustrate that our intentional reasoning module can: 1) effectively perform cross-domain knowledge selection (by the attention mechanism); 2) effectively perform fine-grained knowledge reasoning (by the intention mechanism).

B.3 Error Analysis
To better understand the limitations of our model, we conduct an error analysis on IR-Net. We randomly select 100 responses generated by IR-Net that achieve low human evaluation scores in the test #2 dialogue Question: please give me the specific address to my friend's home.
Coarse-grained Response: @poi_type is at @address Fine-grained Response: friend home is at 347 alta mesa ave Gold: ok, you can try setting navigation to 347 alta mesa ave hop:  set of SMD. We report several reasons for the low scores, which can roughly be classified into four categories.
(1) KB information in the generated responses is incorrect (35%), especially when the corresponding equipped knowledge base is large and complex.
(2) The sentence structure of the generated responses is incorrect and there are serious grammatical and semantic errors (26%).
(3) The model makes incomplete response when there are multiple options corresponding to the user intention (24%). (4) The conceptual tokens generated by the fine-grained decoder cannot be well matched with the golden entities (15%).