Aligning Recommendation and Conversation via Dual Imitation

Human conversations of recommendation naturally involve the shift of interests which can align the recommendation actions and conversation process to make accurate recommendations with rich explanations. However, existing conversational recommendation systems (CRS) ignore the advantage of user interest shift in connecting recommendation and conversation, which leads to an ineffective loose coupling structure of CRS. To address this issue, by modeling the recommendation actions as recommendation paths in a knowledge graph (KG), we propose DICR (Dual Imitation for Conversational Recommendation), which designs a dual imitation to explicitly align the recommendation paths and user interest shift paths in a recommendation module and a conversation module, respectively. By exchanging alignment signals, DICR achieves bidirectional promotion between recommendation and conversation modules and generates high-quality responses with accurate recommendations and coherent explanations. Experiments demonstrate that DICR outperforms the state-of-the-art models on recommendation and conversation performance with automatic, human, and novel explainability metrics.


Introduction
Conversational recommendation systems (CRS) (Liu et al., 2020;Li et al., 2022) aim to conduct recommendations during conversations with users (Gao et al., 2021).Compared with traditional recommendation systems (Wang et al., 2019;Xian et al., 2019), CRS's two main advantages are understanding user's dynamic interest during the conversation and making persuasive response with coherent explanations of the recommendation (Jannach et al., 2021).In both advantages, user interest shift plays an essential role.As the dialog in Fig. 1, the successful recommendation of "Iron Man 3 " in the final response is achieved by tracking and reasoning the user interest shift of "The Avengers→Sci- Fi→Thor→Stan Lee→Iron Man 3 ".Furthermore, the final response is persuasive because of utilizing part of the user interest shift, i.e., "Thor→Stan Lee→Iron Man 3 ", as the explanation.
Due to the limited context (Hayati et al., 2020), a recommendation module based on knowledge graph (KG) is helpful to track the user interest shift in conversation.As shown in Figure 1, formally corresponding with the paths in KG, the user interest shift in conversation not only guides the reasoning-based prediction of recommendation, but also guides the explanation generation in response.
However, existing KG-enhanced CRS models (Chen et al., 2019;Zhou et al., 2020Zhou et al., , 2021Zhou et al., , 2022b;;Zhang et al., 2022a) have not made full use of the user interest shift to tightly align the KG-based recommendation and conversation.Consequently, one issue is the less accurate recommendation due to using unrelated entities in conversation to support the recommendation instead of using coherent en-tities in user interest shift.The other issue is the lack of explanation in response due to black-box representation of the user preference ignoring the explicit preference logic in user interest shift.
To address these issues in CRS, we propose to align the explicit behaviors of recommendation reasoning and conversation process, which are described as the recommendation paths and interest shift paths, respectively.As in Figure 1, a recommendation path is an explicit path in KG consisting of explicit relations of entity nodes and ending with a predicted recommended entity; an interest shift path is an implicit path in dialog context consisting of implicit relations of entity words.The recommendation path and interest shift path are concrete manifestations of the user interest shift in KG and dialog, respectively.The sequence of interest entities shared by the two paths facilitates the alignment of recommendation reasoning and conversation process, which can be effectively achieved by imitation learning (Ho and Ermon, 2016).Therefore, we propose a dual imitation framework named DICR (Dual Imitation for Conversational Recommendation).DICR designs bidirectional alignment signals from dual imitation learning to improve the CRS by forcing the recommendation and conversation to behave similarly to the shared user interest shift.
Precisely, in a conversation-aware recommendation module, to align the recommendation reasoning to the conversational user interest shift, the recommendation side of the dual imitation, i.e., path imitation, adopts adversarial reinforcement learning to make the recommendation reasoning policy imitate the user interest shift in conversation.The reasoned recommendation paths are provided to the conversation module as alignment signals.In a recommendation-aware conversation module, to align the conversation process to the recommendation paths, the conversation side of the dual imitation, i.e., knowledge imitation and semantic imitation, which refine weight distribution and semantic encoding of recommendation paths by imitating the human response and the utterance statement semantic of golden explanations, respectively.These two imitations also provide the recommendation module with the rewards as the alignment signals indicating how well the predicated recommendation paths consist to the conversation context.

Our contributions are summarized as follows:
(1) To the best of our knowledge, we are the first to adopt imitation learning in CRS to integrate recommendation and conversation tightly.We design a dual imitation framework named DICR, which aligns recommendation and conversation behavior and promotes bidirectional improvement, taking recommendation paths and conversational rewards from the dual imitation as alignment signals.
(2) The dual imitation benefits the knowledge acquisition and semantic generation, promoting the accuracy of recommendation and significantly improving the explanations of recommendations in generated responses with coherent knowledge.
(3) Extensive experiments demonstrate that DICR outperforms the SOTA models on both recommendation and conversation performance with automatic, human and novel explainability metrics.

Related Work
Conversational recommendation systems (CRS) aim to obtain user interests through conversational interaction and make persuasive recommendations (Jannach et al., 2021;Gao et al., 2021;Ren et al., 2021).To track the user interest shift in conversation, an intuitive strategy is to ask related questions (Kostric et al., 2021;Zhang et al., 2022b) which leads to question-based CRS (Lei et al., 2020;Deng et al., 2021).Limited by predefined templates for asking and recommending, it is difficult for question-based CRS to flexibly adopt different contexts and converse in a human-like manner.
Towards more flexible conversation, generationbased CRS (Li et al., 2018;Zhang et al., 2021;Liang et al., 2021) capture user interests from context and generate responses containing persuasive explanations for recommendation.Limited by sparse context and language complexity (Lu et al., 2021;Yang et al., 2022), it is challenging for generation-based CRS to track user interest shift in conversation.As a popular solution, KG-enhanced CRS (Moon et al., 2019;Ma et al., 2021;Zhou et al., 2022a) involve knowledge graphs about explicit relations among potential interest items.
Although KG-enhanced CRS has achieved significant improvements, most approaches (Chen et al., 2019;Zhou et al., 2020Zhou et al., , 2021Zhou et al., , 2022b) adopt a black-box style to transfer implicit and sparse information between recommendation and conversation.The explicit interest reasoning path in KG and explicit interest shift path in conversation have a good chance to align the recommendation and conversation to benefit each other.
is a golden interest shift path connecting the interest entities e 0,1,...,l in C and Y .K also matches a recommendation path p in G. K can be extracted by identifying entities in the conversation and linking the entities to the nodes in G.The logical utterance statement of K is U , which is the explanation of recommendation, e.g., given a one-hop reasoning path in G ("Thor", "written_by", "Stan Lee"), its tokenized U can be "Thor is written by Stan Lee.".
In this paper, given a conversation context C and a KG G, we aim to generate a response Y containing the recommendation set I and explanation U .We design a novel CRS model, in which KG path reasoning obtains explicit reasoning path set P from G to help generate Y containing recommendation set I and coherent explanation U of I.

Architecture Overview
As shown in Fig. 2, in DICR, the conversationaware recommendation module learns recommendation path reasoning policy with adversarial reinforcement learning.The path imitation discriminator aligns the recommendation paths with the golden interest shift path to reward the agent R p,t to optimize the reasoning policy.As a result, the top tokenized recommendation paths are provided to the conversation module as alignment signals.In the recommendation-aware conversation module, the knowledge imitation aligns the prior and posterior recommendation knowledge in tokenized recommendation paths and human response, respectively.The semantic imitation uses Mutual Information Maximization (MIM) to align the semantic encoding of recommendation paths with that of the utterance statement of the golden interest shift path.These imitations refine distribution of knowledge and overall words and thus benefit the path-aware response generation.They also generate rewards R k,t and R s,t as alignment signals to guide the recommendation path reasoning.Finally, DICR performs a joint training to bidirectionally promote the recommendation and conversation with alignment signals from dual imitation.

Conversation-aware Recommendation Module
In this module, we formalize the user interest shift in conversation as a Markov Decision Process (MDP) (Sutton and Barto, 2018) through KG paths to reason interest shift path with adversarial reinforcement learning.We construct KG embeddings (Bordes et al., 2013) of each entity.The entities mentioned in C are extracted by fuzzy matching, whose embeddings are averaged as the preference representation of user u in the current context.
State.We start path reasoning from the starting entity e 0 in C. The initial state s 0 ∈ S is s 0 = {u, e 0 }.We encode the H-step history of entities and relations as the observed state s t ∈ S at step t, i.e., s t = {u, e t−H , . . ., r t , e t }, whose embedding s t is obtained by concatenating the embeddings of all members of s t , i.e., s t = u ⊕ e t−H . . .⊕ r t ⊕ e t , where u is preference representation, ⊕ is the concatenation operator.If the path length is smaller than H, we pad s t with zeros.
Action.The action space A t of the state s t is defined as all outgoing edges of the entity e t in the KG G, excluding history entities and relations.A t = {(r, e) | (e t , r, e) ∈ G, e / ∈ {e 0 , . . ., e t−1 }}.As an option to terminate, A t has a self-loop edge.
Transition.Given the current state s t and the action chosen by the agent a t = (r t+1 , e t+1 ), the next state s t+1 is: {u, e t−H+1 , . . ., r t+1 , e t+1 }, where T : S × A → S refers to the state transition function.
Reward.We only give the agent terminal reward R T,t , R T,t is 1 if the agent generates a path ends with the recommended items I Y in the response Y , and 0 otherwise.
Policy Optimization.We adopt adversarial imitation learning (Zhao et al., 2020) based on the Actor-Critic method for policy optimization.Actor learns a path reasoning policy π φ (a t , s t , A t ) which selects a "good" action a t based on the current state s t and the action space A t to "fool" the discriminator in the path imitation.Critic estimates the value Q δ (s t , a t ) of each action a t in the situation of the state s t to guide the actor to choose a "good" action.We use two fully connected layers as the actor policy network: (1) where A t denotes that the action space is encoded by stacking the embedding of all actions in A t , and each embedding in a t ∈ A t is obtained by a lookup layer.η(•) is the softmax function, f (•) is the ELU activation function, W φ,1 and W φ,2 are learnable.We design a critic network as: where a δ,t is the embedding of action a t in the critic, f (•) is the ELU activation function, W δ,1 and W δ,2 are learnable.
Path Imitation To guide the actor to generate a path in line with the user interest shift, we design the path imitation discriminator I p,τ , which judges whether the path segment generated by the actor at each step t is similar to the golden interest shift path segment in current context.Given the current state s t and action a t , the probability that I p,τ outputs (s t , a t ) conforms to the golden shift path segment: ) where s K t and a K t respectively denote the state and action of the golden shift process in the same step t.We further obtain the reward R p,t given by I p,τ to the actor at each step t: R p,t = log (I p,τ (s t , a t )) − log (1 − I p,τ (s t , a t )) .
(5) Here, the aggregation reward obtained by the agent is R t = αR p,t + (1 − α)R T,t .where α ∈ [0, 1].In the final joint training with the conversation module, the agent receives two other rewards R k,t and R s,t from the conversation module.Given Q δ (s t , a t ), the actor and critic is updated jointly by minimizing the loss function L φ,δ : (Bellman, 2013).The actor, critic and path imitation discriminator is jointly optimized by minimizing Beam Search of Recommendation Paths After agent pre-training, we adopt beam search to generate candidate recommendation paths.Sorted by the probability of leading to accurate recommendation, top N p paths are tokenized into a statement containing entity and relation words, which are provided to the conversation module as alignment signals.

Recommendation-aware Conversation Module
Encoder The conversation context C, the response Y , the utterance U and the tokenized recommendation paths P are encoded by the context encoder, the knowledge encoder and the semantic encoder, respectively, based on Bi-RNN.Given the input sequence X = (x 1 , . . ., x N ), the forward and the backward RNN respectively generate hidden states h f t and h b t for each x t , which are concatenated to form the overall hidden state h t : where [; ] is the concatenation operation.We denote the hidden states of all time steps as H = (h 1 , h 2 , . . ., h N ), o = h f N ; h b 1 as the final hidden state.For all input sources, we obtain Knowledge Imitation To refine the distribution of tokenized recommendation paths of leading to accurate recommendation with proper explanation, knowledge imitation makes the tokenized recommendation paths imitate the human response, which often contains correct recommendation destination without strong explanation.Given the encoded conversation context o C and o P = {o P,i } Np i=1 of the encoding of tokenized recommendation paths, the o C server as the prior information.We first obtain the prior path weight distribution by the similarity between o C and each path p i : Since the recommendation paths contain the predicated interest in response, the prior information is insufficient to calculate the recommendation paths distribution.Therefore, the imitation also involves the human response as posterior information to obtain the posterior distribution of the paths.
where W K,Y is learnable.We use Kullback-Leibler divergence loss L KL to make P (p i | C) imitates P (p i | Y ), and BOW Loss L BOW to enforce the relevancy between recommendation paths distribution and response (Lian et al., 2019).
Semantic Imitation To refine the semantic encoding of tokenized recommendation paths, semantic imitation makes tokenized recommendation paths imitates the golden utterance statement of correct recommendation and coherent explanation.Given the encoded conversation context o C and the hidden state o P = {o P,i } Np i=1 of the tokenized recommendation paths encoding, we apply attention (Bahdanau et al., 2015) to the o P to obtain the context-based path aggregation representation o S,P = Attention (o P , z (W S,P o C )), where z(•) is the tanh function, W S,P is the parameter matrix.
To make o S,P and the semantic o U of the encoded golden interest shift path behave similarly, we adopt the Mutual Information Maximization (Cao et al., 2021), which forces the learned contextbased aggregation representation to equip with the semantic of the golden utterance statement via maximizing the mutual information between o S,P and o U .We use binary cross-entropy loss as the mutual information estimator.The learning objective is: where P and N represent the set of positive and negative samples, respectively.o S,P is the random sampled negative sample's encoding.I S,ϕ is a semantic imitation discriminator to score o S,P and o U via a bilinear mapping function: where σ( are obtained for H P,i of each path p i .

553
We obtain the overall path representation v P t = Np i=1 µ P,i • v P,i t , where µ P,i = P (p i | Y ) in training, and µ P,i = P (p i | C) in inference.To reduce the impact of inaccurate recommended paths, we design a fusion gate g t to determine the contribution of v P t to the fusion information v t : ) where W g is learnable.Hence.the decoder updates its state as: h t+1 = GRU (h t , [y t ; v t ]), where y t is the embedding of predicted word at time step t. h t and v t are also used to obtain the generation probability P vocab (w t ) over the vocabulary at time step t, formalized as P vocab (w t ) = ρ ([h t ; v t ]), where ρ(•) is a two-layer MLP with a softmax function.Furthermore, we adopt a pointer copy mechanism to copy tokens from the tokenized recommendation paths P, which ensures that the logical knowledge in the path can be copied to enrich the explanation in the response.At time step t, the probability of copying tokens from P is a weighted sum of copying tokens from all paths over the path distribution: where p j i is the token in the path p i , d P,i t,j is the attention weights of the j th token in p i .We use a pointer generation probability ξ gen t (See et al., 2017) to obtain the overall probability distribution: where ξ gen t = σ (W gen [y t−1 ; h t ; v t ]), W gen is learnable.When training the conversation module, we use additional NLL Loss to quantify the difference between the golden and generated response: In summary, the conversation module is jointly optimized by minimizing the joint loss

Bidirectional Improvement of Two Modules
After training recommendation and conversation modules, we conduct a bidirectional joint training.The conversation module provides the recommendation module with the rewards R k,t and R s,t from the knowledge imitation and semantic imitation, respectively.R k,t is the knowledge consistency between recommendation paths and human response.The recommendation module provides the conversation module with optimized recommendation paths to guide the response generation.In this way, the bidirectional joint training optimizes the alignment between recommendation and conversation and promotes the overall performance of CRS.

Experiment Setup
Dataset We did experiments on OpenDialKG (Moon et al., 2019), a dialog↔KG parallel corpus for CRS, where the mentions of KG entities and their factual connections in a dialog are annotated.The user interest shift path is extracted from context-response pairs, where its start entity is in the context, and destination entity is in the response.Each path is tokenized into an utterance statement that weaves together the entities and relations mentioned in the conversation.More details on data and experiments are in Appendix A and B. Implementation Details We implemented our model with Pytorch.In the recommendation mod- ule, the history length H = 1 and the maximum length of the reasoning path is 3.The maximum action space is 250.We trained the KG with the embedding size 128.The rewards weights are α = γ = 0.006, β = 0.001.The conversation module receives N p = 10 recommendation paths.

Models for Comparison
All encoders and decoders have 2-layers with 800 hidden units for each layer.The word embedding is initialized with word2vec and size 300.We used the Adam optimizer (Kingma and Ba, 2015), the batch size is 32 and the learning rate is 0.0001.We trained our model with four steps.We first trained the model to minimize the L REC loss, then minimized the BOW loss and BCE loss for pre-training knowledge imitation and semantic imitation components, and then minimized the L GEN loss.Finally, we jointly trained the whole model.

Evaluation on Recommendation
In recommendation evaluation, we use Recall@K (K=1,10,25) indicating whether the top-k predicted items include the golden recommendation item.  1, all imitation components contribute to the recommendation performance, with the rewards within module (PI) or across modules (KI and SI) for the reasoning policy learning.On the one hand, the conversational rewards (i.e., from KI and SI) are used as the alignment signals and guide the recommender to learn user interest shift policies.

Overall Evaluation As in
On the other hand, the dual-reward-reinforced (i.e., from PI, KI and SI) recommendation paths serve as the alignment signals in turn improve the conversation by promoting the positive cyclic learning of bidirectional interaction with dual imitation.

Evaluation on Conversation
Overall Evaluation To evaluate the overall performance of the conversation, we use BLEU-1/2 and Distinct-1/2 (Dist-1/2) to evaluate the quality and diversity of the generated responses.F1 is the F1-score measuring how well the responses contain golden knowledge.In Table 2, DICR outperforms all baselines significantly in most metrics.DICR achieves 13.3%, 9.6% improvements on Bleu-1/2 compared to the best baselines, which supports the effectiveness of our method, i.e., aligning the recommendation reasoning and conversation process.DICR achieves the best result on F1, demonstrating that explicit recommendation paths (i.e., DICR, ACRG) are superior to implicit embedding semantics (i.e., baselines except for ACRG) in guiding the generation of knowledge-rich responses.
Hit and Explainability Accurate recommendations with coherent explanations is one of our main contributions.We propose "Hit" to measure the recommendation success rate in conversation.Hit is the hit rate at which recommended items in the golden response are included in the generated response.The explainability of the response is evaluated by logically linked entity pairs which are necessary for coherent explanation.Specifically, "Inter" counts the entity links between context and response, which evaluates contextually coherent explanation across context and response."Inner" counts the entity links within the response, which evaluates self-consistent explanation in response."G" counts the entity links that can be matched in KG and thus evaluates the explanations with global KG knowledge."P" counts the entity links that can be matched in the recommendation paths in KG and thus evaluates how well the recommendation paths support the explanation generation.Finally, we have four combined indicators, "G-Inter, G-Inner, P-Inter, P-Inner", e.g., "G-Inter" evaluates the coherent explanation according to KG knowledge.
In Table 3, DICR outperforms all baselines and obtains a significant improvement on Hit and explainability.First, DICR achieves 34% improvements on Hit compared to the best ACRG, which verifies the effectiveness of the dual imitation mechanism for aligning the consistent behavior of the recommendation reasoning and conversation process.Second, DICR obtains 3.6% and 9.1% gains on G-Inter and G-Inner compared to the best C 2 -CRS, which shows that DICR prefers to generate logically coherent explanations within responses and across context and response.Third, DICR improves 30.7% and 18.3% on P-Inter and P-Inner compared to the best ACRG, which indicates that the conversation side of the dual imitation (i.e., KI and SI) can effectively identify and integrate recommendation paths for response generation.(Fleiss, 1971) measures the agreement among the annotators.As the results in Table 4, the superior of DICR on all indicators support the observations from automatic evaluations.

Human Evaluation
Ablation Study In Tables 2 and 3: (1) We separately remove the path imitation, knowledge imitation, and semantic imitation to examine their contribution, namely w/o PI, w/o KI, and w/o SI, respectively.In the results, the path imitation mainly benefits the inner coherent of explainability ("G/P-Inner"), which verifies its designed advantage to indirectly guide the explanation logic by accurate recommendation paths.The knowledge imitation mainly benefits recommendation hit ("Hit"), inter coherent of recommendation ("G/P-Inter") and distinct of responses ("Dist-1/2"), which verifies its designed advantage to refine the distribution of recommendation paths for accurate recommendation and to encourage diverse explanations in response.
The semantic imitation also mainly benefits inner coherent of explainability and is more important to inter coherent of recommendation than path imitation, which verifies its designed advantages to improve the semantic of response by promoting inner and inter coherent.
(2) We remove the fusion gate, namely w/o FG.The results show that the dynamic information fusion mechanism achieves an

Case Study
In Figure 3, two cases from eight models are selected, among which DICR has two advantages: (1) The items recommended by DICR is more accurate and likely to have explicit multi-hop relation with the items mentioned by the user, being consistent to "Hit" and "G/P-Inter" in Table 3, e.g., in Dialog-2, "Higher Ground" in response and "Tower Heist" in context share the actor "Nina Arianda."This is evidence of improving recommendation by tracking user interest shift in conversation, which mainly benefits from the path imitation and the knowledge imitation, as verified by ablation study in Table 3; (2) DICR naturally tells the items' relation as an explanation, being consistent to "G/P-Inner" in Table 3 and "Explain." in Table 4, e.g., in Dialog-1, the director "Martin Campbell", the movie "GoldenEye" and the genre "thriller" derive from a recommendation path with coherent relation.This is evidence of improving conversation by involving recommendation path as an explanation, which mainly benefits from the semantic imitation, as verified in Table 3.

Conclusions
We propose DICR, which adopts the dual imitation to align CRS's recommendation and conversation behavior explicitly.Using recommendation paths and conversational rewards as alignment signals for tight interaction between recommendation and conversation, DICR achieves accurate recommendations and coherent explanations in generated responses.The effectiveness of DICR is verified by designed novel explainability evaluations together with human and existing automatic metrics.

Limitations
We discuss two main limitations of this work which can be further studied in future work.The first one is the reliance of explicit knowledge in knowledge graph.Although using knowledge graph is a common advantage of most current CRS studies, and explicit relations between entities leads to effective and reliable reasoning for recommendation, there are still a large amount of implicit knowledge in unstructured resources which cannot be extracted as explicit triplet, e.g., the multidimensional similarity between entities, but can be further extra supplement to dialog context.The second one is the task of next-turn recommendation.As the main contribution of this work, although the modeling of user interest shift significantly improve the performance of making recommendation in next-turn response, the user interest shift modeling can also naturally help us to guide the user interests towards proper recommendation through smooth and persuasive multi-turn conversation with users.To address this limitation, in the future, we will extend the idea to align the KGbased reasoning and conversation process towards

A Dataset
The statistics of OpenDialKG after preprocessing are in Table 5.We did not employ other CRS datasets.The reason is that compared with OpenDialKG, in other datasets like REDIAL (Li et al., 2018), dialogs mention the recommended items without rich related information and tend to mention only movie names rather than an in-depth discussion on the movie preference, which is considered as the recommended explanation in this paper.As reported in the CRFR (Zhou et al., 2021), OpenDialKG's advantages improve the performance of CRFR and the compared CRS baselines in our experiments.

B Analysis of the Number of Recommendation Paths
We analyze the influence of the number of recommendation paths on Hit and explainability, and Figure 4 presents the results.First, as shown in Figure 4(a), with the increase of the number of recommendation paths which are used as the alignment signals for the recommendation side of the dual imitation, the Hit scores slightly improve in fluctuations.This indicates that the conversation side of the dual imitation can effectively identify the golden recommendation paths and prompt the conversation process to align the recommendation reasoning.
Second, in Figure 4(b) and 4(c), G-Inter/Inner and P-Inter/Inner both improve distinctly as the number of paths increases.This improvement is attributed to the advantage that knowledge imitation and semantic imitation endow the DICR with discerning and integrating the coherent knowledge in the recommendation paths as the recommended explanation in response.This advantage aligns the recommendation reasoning to explanation generation, which helps the model refine the discerned knowledge and display them in the generated response.

Figure 1 :
Figure 1: The interest shift process expressed in the conversation can guide the generation of explainable recommendation path.The explainable recommendation path, in turn, can guide the generation of explainable response containing accurate recommendations.Recommendation and conversation maximize mutual benefits in bidirectional guidance.

Figure 3 :
Figure 3: Cases generated by different models, indicting multi-hop entities and correct/incorrect relations.

Figure 4 :
Figure 4: The influence of the number of recommendation paths on Hit and Explainability.As the number of recommendation paths increases, DICR improves on Hit and Explainability metrics and outperforms the best baseline on most cases.

-aware Conversation Module Conversation-aware Recommendation Module
•) is the sigmoid function, W S,ϕ is the parameter matrix.For the response generation, an MLP layer merges the learned semantic o S into the hidden state o C of the conversation context as the initial hidden state of the decoder, where o S = o U if U is available, otherwise o S is the learned o S,P .In the inference stage, U is unknown.Path-aware Response Generation We employ GRU to integrate context and path information to generate a response.Given the decoder state h t , the output states H C and {H P,i } If path p ∈ P, R k,t = log (µ P,i ) + log (1 − µ P,i ), where i is the index of p in P, otherwise R k,t = 0. R s,t is the semantic similarity between the path segment generated at step t and the golden utterance.R s,t = log (I S,ϕ (o p,t , o U ))+ log (1 − I S,ϕ (o p,t , o U )), where o p,t is the hidden state of the tokenized path segment encoded by the semantic encoder.The aggregation reward isR t = αR p,t +βR k,t +γR s,t +(1−α−β −γ)R T,t , where α + β + γ ∈ [0, 1].If the path is shorter than the maximum reasoning length, β = 0.

Table 1 :
Overall evaluation on recommendation.w/o refers to removing the component from DICR." * " indicates the statistical significance for p < 0.001 compared with the best baseline (t-test with p-value < 0.001).

Table 2 :
Overall evaluation on conversation.w/oand"* " have the same meaning as those in Table1.andsemantic imitation from DICR to examine their contribution, called w/o PI, w/o KI and w/o SI.In Table

Table 3 :
Evaluation on hit and explainability.w/oand" * " have the same meaning as those in Table1.
In human evaluation, we randomly sampled 200 contexts.Each context is associated with eight responses from eight comparison

Martin Campbell directed the movie GoldenEye, which is a thriller. Do you like that genre? • Nina Arianda is in it. She also starred in Higher Ground.
• I love adventure films.Do you know any?• Sure, Tower Heist, Touching the void, Total Recall and Tomorrow Never Dies.Which one do you like? • Tower Heist is great who is that starring again?Yes, she starred in The Tree of Life and An Bits of the Year.ACRG• Sure!He starred in GoldenEye.Have you seen it?•Yes, it is a great movie.It also stars Nina Arianda.

Table 5 :
Statistics of our datasets after preprocessing.