CRFR: Improving Conversational Recommender Systems via Flexible Fragments Reasoning on Knowledge Graphs

Although paths of user interests shift in knowledge graphs (KGs) can benefit conversational recommender systems (CRS), explicit reasoning on KGs has not been well considered in CRS, due to the complex of high-order and incomplete paths. We propose CRFR, which effectively does explicit multi-hop reasoning on KGs with a conversational context-based reinforcement learning model. Considering the incompleteness of KGs, instead of learning single complete reasoning path, CRFR flexibly learns multiple reasoning fragments which are likely contained in the complete paths of interests shift. A fragments-aware unified model is then designed to fuse the fragments information from item-oriented and concept-oriented KGs to enhance the CRS response with entities and words from the fragments. Extensive experiments demonstrate CRFR’s SOTA performance on recommendation, conversation and conversation interpretability.


Introduction
Different from traditional one-shot recommendation systems (Jannach et al., 2020), conversational recommender systems (CRS) obtain users' interests through multi-turn conversation, and make recommendations with responses. Typical CRS consists of two parts: recommender and response generation. The recommender aims to understand users' dynamic preference from contextual utterances to find the items matching the preference best. Subsequently, the response generation aims to generate appropriate sentences asking for more information or exhibiting the recommended items and related explanation. Recommender and response generation are expected to be mutually beneficial.
However, contextual utterances are usually insufficient to understand users' preference (Zhou et al., 2020a). External knowledge, especially knowledge graphs (KGs), helps CRS to alleviate the problem Context: I'm looking for a suspense movie like Shutter Island, and The Prestige is also my favorite.
Response: I recommend Inception with Christopher Nolan and Leonardo DiCaprio, it has an unpredictable story.  Figure 1: An example of conversational movie recommendation. In item-oriented and concept-oriented KGs, multiple high-order reasoning paths contain fragments, from which items and concepts support the recommendation and the interpretability of response. (Chen et al., 2019;Zhou et al., 2020a). Existing KG-based CRS still have some issues.
The first issue is the lack of explicit reasoning, especially high-order explicit reasoning, in KG to track the deep shift of user interest in conversations. For example, in Figure 1, the user prefers "Shutter Island" and "The Prestige" due to certain attributes, i.e., actor "DiCaprio" and director "Nolan". In this case, both "DiCaprio" and "Nolan" lead to a 2order explicit reasoning of the user's interest shift to "Inception" in the KG. Such multiple high-order explicit reasoning paths are strong evidence to recommend "Inception". Besides item KG, explicit high-order reasoning in concept KG can further describe the shift of concept words of interest, e.g., from "suspense" to "unpredictable" in Figure 1, and enhance the interpretability of response. However, because KG nodes usually have many neighbors, effective high-order explicit reasoning is challengeable. Therefore, instead of explicit reasoning, one alternate in existing KG-based CRS (Chen et al., 2019;Zhou et al., 2020a) is to aggregate neighborhood of interest entities in context, which can implicitly gathering high-order relations by multiturn aggregation. Another alternate strategy (Ma et al., 2020) is to explicitly grow a high-order tree in KG starting from interest entities to cover the potential preference, without aiming to the best recommended items. Both strategies can improve CRS, but have not full appreciation of the advantage of tracking interest shift in KG.
The second issue is that KGs are usually incomplete to track the path of interest shift. For example, in Figure 1, the user mentioned the two movies probably because he/she likes "Nolan" and "Di-Caprio". Multiple paths in KG help to locate the subspace of the user's interest and generate interpretive utterances in line with people's dialogue behavior, e.g., "...Inception with Christopher Nolan and Leonardo DiCaprio...". However, KGs can not record all relations of interest entities involved in real-world diverse dialogues (Ma et al., 2020;Hayati et al., 2020;Sarkar et al., 2020). Therefore, it is often difficult to achieve a complete reasoning path of interest shift within limited number of hops.
To address these issues, we improve Conversational Recommender Systems with Flexible Fragments Reasoning on KGs (CRFR). CRFR uses an explicit multi-hop reasoning method to model the user's interest shift in a conversation with respect to an item-oriented KG (DBpedia (Bizer et al., 2009)) and a concept-oriented KG (ConceptNet (Speer et al., 2017)). We formalize the interest shift as a Markov Decision Process and propose an explicit policy-guided reasoning model based on reinforcement learning. Due to the possible absence of complete reasoning path in limited hop, instead of finding single best reasoning path, the learned policy flexibly obtains a set of high-order optimal path fragments. The obtained path fragments point to the destination of the user's interest shift and are likely to be contained in complete interest shift paths.
To make recommendation with multiple reasoning path fragments, we further design a fragmentsaware recommendation module to fuse fragments information to learn the final representation of user preference. Subsequently, in a reasoning information enhanced dialog module, the user preference representation is integrated into the response at token level. In this way, both the entities and words in reasoning path fragments can be flexibly involved and thus enhance the informativeness and interpretability of generated response.
Our contributions are summarized as follows: (1) To explicitly model the multiple shift paths of user's interest in a conversation, we propose a multi-hop policy reasoning model based on reinforcement learning with respect to item-oriented and concept-oriented KGs.
(2) To avoid the difficulty of exact reasoning in incomplete KG, CRFR flexibly obtains an optimal set of path fragments from heterogeneous KGs.
(3) To effectively use obtained fragments, CRFR dynamically encoders the fragments to facilitate the recommendation and dialog response.
(4) Extensive experiments demonstrate that CRFR exceeds the state-of-the-art baselines in recommendation accuracy and generates high quality response with more interpretability.

Related Work
Conversational recommender systems (CRS) can be divided into two types. One is agent-driven CRS known as "system ask, user respond" (Gao et al., 2021;Zou et al., 2020). This strategy obtains user preference by asking for predefined attributes (Sun and Zhang, 2018;Lei et al., 2020a) and makes response by the user's feedback Xu et al., 2021;. Graph reasoning (Lei et al., 2020b) and comment information  can make the recommendation process of agent-driven CRS interpretable. Although agent-driven CRS is popular in specific field, predefined template limits its generalization ability.
The other is user-oriented CRS known as "user talk, system understand" (Kang et al., 2019;Liao et al., 2019;Zhou et al., 2020b). Due to the insufficient context, user-oriented CRS often need the help of external knowledge, e.g., knowledge graphs (KGs). Closely related to our work, Chen et al. (2019) integrates KGs to connect recommender and response generation. Zhou et al. (2020a) further uses item-level and word-level KGs to align two different semantic spaces. However, these two approaches only implicitly aggregating the neighbor information with GCN-based methods, instead of conducting explicit reasoning in KGs to track the deep user preference shift.
An ideal explicit KG reasoning for CRS is supposed to accurately describe the shift process of user interests using a single complete path. How-ever, it is usually challengeable due to the incompleteness of the KGs, the large search space and the diversity of expressing user interests. Different from the approaches looking for single path (Moon et al., 2019), the reasoning tree (Ma et al., 2020) executes relatively undirected reasoning and lacks concept relations in the reasoning.

Preliminaries
In general, a knowledge graph G with entity set E and relation set R is defined as G = {(e, r, e ) | e, e ∈ E, r ∈ R}, where each triplet (e, r, e ) represents a relation r from entity e to entity e . In this paper, we denote the item-oriented knowledge graph DBpedia (Bizer et al., 2009) as G D and the concept-oriented knowledge graph Con-ceptNet (Speer et al., 2017) as G C . Correspondingly, their entity sets are E D and E C respectively. A multi-hop complete reasoning path p e 0 ,et , which connects a starting entity e 0 and a target entity e t : In this work, given an n-turn conversation history H, which contains the utterances of each turn, CRFR takes conversation history H with the user u, graphs G D and G C as input, utilizes context information to perform flexible policy reasoning on G D and G C to obtain high-order path fragments set P that point to the destination of the user's interest shift, and further uses P as a guide to output the response containing the recommended item and the explanation of the recommendation.

Methodology
In this section, we introduce our proposed CRFR model. The overall framework is shown in Figure 2. CRFR executes flexible multi-hop policy reasoning on KGs G D and G C to obtain high-order path fragments that point to the destination of the user's interest shift on the two KGs respectively. Path fragments from G D help the fragments-aware recommendation module to obtain a more accurate user preference representation. Further, the reasoning information enhanced dialog module integrates the user's preference representation at the token level to flexibly select two types of path fragment information to guide response generation.  (Edwards and Xie, 2016;Kipf and Welling, 2017) to encode knowledge graphs G D and G C , respectively. A representation model of entities is pre-trained by adjusting two KGs to identical semantic space. We formalize the user interest shift as Markov Decision Process (MDP), and learn the shift policy by performing multi-hop path reasoning in KGs. This section uses G D as a MDP environment to learn path reasoning policy. Similar method is applied to G C .

Flexible Policy Reasoning
State. Initial representation p u 0 of current user u is the aggregate embedding of common entities in G D and H. The aggregate embedding is obtained by the pre-trained representation model. p u 0 is used as the starting state s 0 ∈ S of path reasoning, i.e., s 0 = p u 0 . s t ∈ S is the state representation of t-th step. s t is obtained by concatenating p u 0 , the embeddings of the reasoning history entities h t = (e 0 , e 1 , . . . , e t−1 ) and the entity e t reached at t-th step. Formally, Action. For state s t , its action space A t is all neighbors of e t in KG, except for the entities that already appeared in the path.
For the nodes with a large number of neighbors, we design a pruning strategy function g(u, a) to select important neighbors. g(u, a) = p T u 0 · e a , where e a is the embedding of entity a. The trimmed action space is A t = {a | topk(g(u, a)), a ∈ A t }, where the size of the action space is a hyperparameter separately set for each KG. Specially, action space A 0 of the initial state s 0 is the set of entities mentioned in H.
Transition. Given current state s t = [p u 0 ; h t ; e t ], the agent chooses a t ∈ A t as the next action. The next state s t+1 is transited to by Reward. Intuitively, we reward the decision made at each step by how well it matches the user's preference. Inspired by (Lv et al., 2020;, we adopt a soft reward, which calculates cosine similarity sim e T ,at between e T and the action entity a t as the reward R t at step t, where e T is the embedding of target entity or target concepts, α, β, ε 1 and ε 2 are hyperparameters. sim e T ,at ≥ ε 1 sim e T ,at + β, ε 1 > sim e T ,at ≥ ε 2 max (0, sim e T ,at ) , otherwise.

Dialogue Context
Actor-Critic Network  and ConceptNet (G C ), which are beneficial to recommendation and dialogue. Here, p Di and p Ci are reasoning path fragments in G D and G C , respectively. p u is the final user preference representation.
The goal of the agent is to learn a path finding policy π (a t , s t , A t ), which calculates the probability distribution of each action a t based on current state s t and trimmed action space A t . At the same time, the agent learns a critic network, which calculates the value of a t according to s t . We use two fully connected layers as the policy network, and each layer has an ELU activation function and a dropout layer. The output of the network is further sent to a softmax layer and a fully connected layer to obtain the probability p (a t | s t , A t ) = π (a t , s t , A t ) and value q (a t ) of each action. Specifically, the learning goal is to maximize the expected cumulative reward for all users: where θ RL are the parameters to be learned, γ is the discount parameter. Following reinforce algorithm, the gradient of the learning object becomes: where G t is the discounted cumulative reward starting from state s t to the final time step T .
Flexible Fragments Reasoning. Due to the incompleteness of KG (Sarkar et al., 2020), as a key idea of this work, instead of only modeling user's interest with the destinations of complete reasoning paths, we prefer to model user's interest shift with partial reasoning path, i.e., reasoning fragments.
After training the two policy networks, guided by the probability of each action made by the policy network, we employ beam search to explore the candidate path fragments P candidate on two KGs G D and G C , respectively. We select the fragments with top generating probabilities. Selected fragments are supposed to be most consistent with the user's interest shift process, and will be used in the following fragments-aware recommendation and reasoning information enhanced dialog.

Fragments-aware recommendation
Given fragments obtained by reasoning in itemoriented KG G D , we design a fragments-aware recommendation module to improve the representation of user preference.
First, we collect the beginning nodes of all path fragments, that is, the more important entities selected from the entities mentioned by H. The embeddings of these entities are obtained by a simple lookup operation. We connect the embedding of the beginning entities to the matrix B (H) . Similarly, we also combine the embeddings of the destination of all path fragments to get the matrix D (H) . Next, we apply the self-attention mechanism to obtain a single aggregate representation of the two matrices. Specifically, we use self-attention to learn the representation of b (H) for B (H) : where W 1 α and W 2 α are learnable parameters. The aggregated representation d (H) of D (H) is also obtained in the same way. Then, we design an interactive aggregation method to obtain path fragments' fusion information p agg : where W 1 agg and W 2 agg are learnable parameters and means element-wise product. Next, we use two gate networks to fuse the information of entities (p u 0 in G D ) and the information of concepts (p u 0 in G C ) in H, respectively, to strengthen the user's preference representation: where σ(·) is the sigmoid function, and || is the concatenation operation. W g is the parameter learned separately by the two gate networks. The final fusion representation p agg is the user's preference representation p u . Finally, we conduct inner product of user and item representations to predict their matching score: y(u, i) = softmax p T u e i , where e i is the embedding for item entity i.

Reasoning information enhanced dialog
To enhance dialog, we use the Transformer's decoder (Vaswani et al., 2017) to merge the user's preference with the information of the path fragments from G D and G C respectively at token level. The decoder can flexibly select semantics in two kinds of reasoning fragments to enhance the informativeness and interpretability of the generated response. Intuitively, human interest shift is a continuous process, and the reasoning paths provide explanations of recommendations. Therefore, we encode each path fragment p D i ∈ P D of G D inference into a path vector m (H) , and concatenate m (H) to the matrix M where || is the concatenation operation. For the concept information inferred from G C , we merge the embeddings of the concept words of each order on the path fragments into the matrix N (H) C . In the decoder's i-th layer, we first fuse the output of the selfattention sub-layer with the word bias of the user preference representation: V i−1 = V i−1 + Z (p u ), Where Z : R du → R dw . Next, we add two multihead attention modules, so that the output V i−1 after fusing user preference can conduct attention operation with path information M (H) D and concept information N (H) C in turn. Finally, the output embedding matrix V i of the i-th layer is obtained through a fully connected feed-forward network.
Merged path fragments information is supposed to improve the possibility of reasoning entities or words to appear in response. In this way, we can significantly improve the coherence and interpretability of response. We will demonstrate it in experimental section.

Optimization
We train the parameters in four steps. First, we pretrain the entity representation of KGs as Sec.4.1. Then, use Eq.3 to optimize the policy parameters θ G D RL and θ G C RL of the two agents. Next, to optimize the recommendation, we adopt cross-entropy loss: where i is the index of items, j is the conversation index. After completing the optimization of recommendation, we get the parameter θ Rec . On this basis, we use the loss function of the generative model to learn the dialogue parameter θ Gen : where K is the count of turns in a conversation H. We calculate L Gen for each utterance h k from H and perform gradient descent to update parameters.

Experimental Setup
Dataset. Experiments are conducted on the Re-Dial dataset (Li et al., 2018). ReDial has 10,006 multi-turn conversations and a total of 182,150 utterances of movie recommendation seekers and recommenders. Movies mentioned in utterances are manually annotated. We identified unlabeled  Table 1: Evaluation of recommendation. +Ran-domWalk means using random walk to replace the policy reasoning part. (t-test with p-value < 0.05) entities in utterances by NER and linked them to DBpedia nodes. We divided the training, validation and test set according to the ratio of 8:1:1.
Baselines. Used baselines are as follows. Popularity only sorts the items according to historical recommendation frequency. TextCNN (Kim, 2014) is a recommendation model learning user preference representations by CNN-based encoding of contextual utterances. Transformer (Vaswani et al., 2017) is a dialog generation model of classical Transformer framework. REDIA (Li et al., 2018) is the benchmark CRS model on the ReDial corpus, which is mainly based on HRED (Sordoni et al., 2015;Serban et al., 2016), a recommendation system based on autoencoder and a sentiment analysis module. KBRD (Chen et al., 2019) is a Knowledge-Based CRS model that only uses DBpedia to enhance the user's representation. KGSF (Zhou et al., 2020a) is a state-of-the-art CRS model that aligns two KGs with mutual information maximization for conversation recommendation.
Implementation Details. The default parameter settings can be found in appendix.

Evaluation on Recommendation
To evaluate the recommendation performance of our model, we adopt widely-used Recall metrics including Recall@1, Recall@10 and Re-call@50. They evaluate whether the top-k autorecommended items contain the ground-truth item recommended by human recommended.
Overall Evaluation. As shown in Table 1, RE-DIAL exceeds the classical recommendation models, i.e., Popularity and TextCNN, by using items in the dialogue contexts. KBRD and KGSF get further improvement by fusing the entities and items information in the knowledge graph. Our model achieves the best results using explicit multi-hop policy reasoning to obtain the optimal set of path fragments learning user's interest shift, which outperforms the best baselines with a large margin.
The Effect of Policy Reasoning. Here, we specially examine the contribution of policy reasoning in our model. To this end, we replace the multi-hop policy reasoning by random walk (Spitzer, 2013) in the knowledge graph to obtain multi-hop path fragments. As shown in Table 1, the use of random walk significantly reduce the recommendation performance. This happens because random walk is undirected, while a significant advantage of multihop policy reasoning is being guided by real user interest shift.
The Effect of Explicit Reasoning. We also specially examine the contribution of explicit reasoning in our model. As shown in Figure 3, compared with the baselines, the advantage of our model becomes more significant with the increase of the average number of neighbors of KG nodes. This happens because more neighbors mean more noise for reasoning. Implicit reasoning through aggregating all direct neighborhood is very sensitive to such noise. However, our explicit high-order reasoning is more effective to find correct reasoning directions from noisy neighbors.

Evaluation on Conversation
Diversity and Informativeness. To evaluate the conversation, we firstly calculate "Distinct-n" to measure the diversity of responses and calculate "Item Ratio" which is the proportion of responses containing items to evaluate the informativeness of responses.
As shown in Table 2, CRFR outperforms all baselines on corpus-level language diversity and greatly improves the item ratio.  Figure 3: The effect of explicit reasoning. As the average number of neighbors increases, the performance of our model is always higher than the SOTA baseline, and the improvement is more and more significant.
significant advantages in modeling the relationship between words and items using self-attention mechanism. On this basis, KBRD and KGSF utilize external knowledge to increase the occurrence of the item, but they only use low-order KG information of entities and concepts mentioned in the context. Our method further enhances the occurrence of valuable items and entities consistent with user interest by integrating the high-order reasoning path fragments into response generation. In this way, our model enhances the informativeness of responses.
Interpretability. Secondly, we evaluate the interpretability of response by examining whether there are two logically linked entities in the response or across the response and the context. "logically linked" means being linked in DBpedia. The idea is that containing two logically linked entities are often a necessary condition to have a interpretative expression in a response. In Table 3, "2ER" (2 Entities Ratio) indicates the proportion of responses containing at least two entities in all responses containing entities. "Inner-Con." counts the logically linked entity pairs in responses. "Inter-Con." counts the logically linked entity pairs across the response and context. As shown in Table 3, our model is better than baselines on interpretability with a large margin. This happens because the entities in our explicit reasoning path are exactly logically linked, and this naturally increase the occurrence of logically linked entities pairs in response or across the response and the context.
Human Evaluation. In human evaluation, we examine "Flu." (fluency), "Coher." (coherence), "Info." (informativeness) and "Inter." (interpretability). Fluency and coherence are used to evaluate the language quality of generated responses. Informativeness evaluates whether the response has incorporated rich entity knowledge. Interpretability evaluates whether the response explain the reason of recommended item. 100 multi-turn conversa-   tions are randomly sampled from the test set for human evaluation. The responses are scored from 1 to 3 for each indicator by 5 workers. The average score is finally calculated. Results are shown in Table 4. Among all the baselines, KGSF has the best performance on all indicators. However, KGSF is more inclined to generate words that are repeated in the context and have no practical meaning, and the recommendation response is less explanatory. Our model achieves the best performance on all human evaluation indicators. This further verifies the advantages of our model in automatic evaluation. Ablation Study. In ablation study, we examine three variants of CRFR: (1) "-Concept", which removes the reasoning path fragments information from G C ; (2) "-DBpedia", which removes the reasoning path fragments information from G D ; (3) "-Preference", which removes token level user preference information. As shown in Table 2 and Ta- ble 3, all these three features are indispensable. Especially, user preference information at the token level is most essential, without which the informativeness and interpretability drop significantly. Its function is to flexibly select the information of reasoning path fragments from G C and G D to improve the quality of the generated response.

Case Study
Four cases from three CRS models and ground truth are selected in Table 5. Compared with the other two models, CRFR has four main advantages: (1) CRFR is more likely to recommend a specific film instead of chatting without recommendation, being consistent to "Ratio" in Table 2; (2) The items recommended by CRFR is more likely to have explicit relation with the items mentioned by the user, being consistent to "Inter-Con." in Table 3, e.g., in the 1st case, " Jack and Jill" in response and "Eight Crazy Nights" in context share the actor "Adam Sandler". This benefits from our explicit multi-hop reasoning in item-oriented KG; (3) Especially, one main advantage of CRFR is to naturally tell the items' relation as an explanation, being consistent to "Inner-Con." in Table 3 and "Inter." in Table 4. It is noted that this ability is even often absent in human ground-truth in selected cases. This benefits from using personalized embedding of reasoning fragments in response decoder; (4) Another important advantage of CRFR is to make more friendly and persuasive explanation with descriptive words related to the feature words of user intention in the context, e.g., ghost vs. scary, cute vs. kid and laughs vs. comedy. This benefits from our explicit multi-hop reasoning in concept-oriented KG.

Conclusion
We propose CRFR, which significantly improves the agent response in conversational recommendation by exhibiting items having more clear higherorder relations with users' contextual intention and containing more persuasive explanation. As the essential advantage of this approach, explicit reasoning of high-order fragments in two heterogeneous knowledge graphs is performed by a reinforcement learning model. High-order path fragments obtained by explicit reasoning on item-oriented KG help the model to better track the user preference shift in conversation. The same reasoning on concept-oriented KG further improves the interpretability of response with informative concept words. Heterogeneous fragments are personalized encoded to finally enhance the response generation. Extensive experiments demonstrated that CRFR is superior to the SOTA baselines on recommendation, explanation and language quality.
In future, we will explore to make better use of path information to further improve the interpretability of responses.