CR-Walker: Tree-Structured Graph Reasoning and Dialog Acts for Conversational Recommendation

Growing interests have been attracted in Conversational Recommender Systems (CRS), which explore user preference through conversational interactions in order to make appropriate recommendation. However, there is still a lack of ability in existing CRS to (1) traverse multiple reasoning paths over background knowledge to introduce relevant items and attributes, and (2) arrange selected entities appropriately under current system intents to control response generation. To address these issues, we propose CR-Walker in this paper, a model that performs tree-structured reasoning on a knowledge graph, and generates informative dialog acts to guide language generation. The unique scheme of tree-structured reasoning views the traversed entity at each hop as part of dialog acts to facilitate language generation, which links how entities are selected and expressed. Automatic and human evaluations show that CR-Walker can arrive at more accurate recommendation, and generate more informative and engaging responses.


Introduction
Many researches have been drawn to combine conversational assistants with recommender agents into one framework due to its significance and value in practical use (Sun and Zhang, 2018;Jannach et al., 2020), but creating a conversational recommender system (CRS) that bridges conversation and recommendation still remains challenging.
One of the challenges lies in reasoning over the background knowledge for accurate recommendation. Prior studies usually focused on using context and knowledge as an enrichment to recommendation (Chen et al., 2019;Zhou et al., 2020a), but neglected to fully exploit the connec-  Figure 1: First three turns of an example dialog. The dialog is shown on the left with entities on the KG in bold. The graph on each dialog turn's right demonstrates the reasoning process of CR-Walker, with the reasoning tree marked red. Throughout this paper, candidate items are noted with numbers, and generic classes / attributes with upper-/lower-case letters. The orange/blue color indicates that the entity is mentioned/unmentioned in the previous context. tion between entities 1 to infer the system action. In particular, this requires the system to make multi-path reasoning on background knowledge, since one entity may have multiple relationships with different entities through multi-hop reasoning. For example in Fig. 1, after the user mentioned "Hemsworth", the agent chats on "Vacation" starring "Hemsworth", and further explores the user interests in "Comedy" movies. It then recommends "Thor" based on several distinct paths of reasoning over user preference ("comedy" & "action").
Another challenge lies in fully utilizing the selected entities in response generation. Since different dialog actions can be applied in conversational recommendation, selected entities needs to be properly expressed with the guide of dialog acts, an abstract representation of dialog semantics and intentions, in order to form natural, informative, and engaging utterances to interact with users. However, most previous works (Moon et al., 2019;Lei et al., 2020a) stopped at inferring entities without modeling response generation. In Fig. 1 again, the agent first asks the user's preferred genres and actors, and then talks about the star and the movie to engage the user in the conversation, and last recommends a movie based on the user interests. In addition, the agent provides explanations at the third turn to make recommendation more interpretable and persuasive.
To address these issues, we propose Conversational Recommendation Walker (CR-Walker) in this paper. It first selects a system intent to decide whether the system asks for information, chats about something, or makes a recommendation. Then, it performs tree-structured reasoning on a knowledge graph (KG) and dialog context, creating a reasoning tree comprised of relevant entities to be introduced in response. The hierarchical arrangement of entities on the tree preserves the logical selection order under current system intents, which is transformed to dialog acts. The linearized representation of dialog acts further guides on generating informative and engaging responses with pre-trained language models. Results show that CR-Walker outperforms strong CRS on two public datasets in recommendation and generation tasks.
In brief, our contributions are summarised below 2 : (1) CR-Walker conducts tree-structured reasoning on a knowledge graph and dialog context to explore background knowledge and exploit connection between entities for more accurate recommendation; (2) CR-Walker transforms the reasoning tree into dialog acts that abstract the semantics and hierarchy of selected entities, and thereby generates more engaging responses for recommendation; (3) We evaluate CR-Walker on two conversational recommendation datasets, achieving outstanding 2 Related Work Conversational Recommender Systems (CRS) learn and model user preference through dialog, which support a richer set of user interactions in recommendation (Jannach et al., 2020). Previous CRS can be roughly categorized into two types.
One is recommendation-biased CRS (Sun and Zhang, 2018;Zhang et al., 2018Zhang et al., , 2020Zou et al., 2020) that asks questions about user preference over pre-defined slots or attributes to recommend items. As system response can be grouped into some pre-defined intents, it can be implemented with the help of language templates. Under this simplified setting, approaches of this type do not model language generation explicitly (Lei et al., 2020a,b). Such dialogs can only provide limited actions without revealing why the system makes such recommendation (e.g. by asking on a fixed set of attributes) (Christakopoulou et al., 2016(Christakopoulou et al., , 2018, thus leading to unsatisfactory user experience. Recently, Moon et al. (2019) improves knowledge selection by assuming a single chain of reasoning throughout the conversation. It relies on finegrained annotations that follow single-path reasoning scheme. However, multiple entities can be selected at each reasoning hop (e.g. recommend several items within one turn, each item with different reasons). Therefore, we propose tree-structured reasoning in this work to enable CRS to select multiple entities through multi-path reasoning for accurate recommendation. Xu et al. (2020) introduces a dynamic user memory graph to address the reasoning of user knowledge in CRS, which is beyond the scope of this paper.
The other is dialog-biased CRS (Li et al., 2018;Kang et al., 2019;Liao et al., 2019; that makes recommendations using free text, which have much flexibility to influence how the dialog continues. As these systems suffer from existing limitations in NLP (e.g. understand preference implicitly from user expression), most methods incorporate external information such as KG and user logs to enhance the dialog semantics (Yu et al., 2019;Zhou et al., 2020a) Figure 2: Left: Illustration of CR-Walker's overall architecture. CR-Walker first decides the system intent and then applies walker cells to perform tree-structured reasoning on the knowledge graph in two stages. The transformed dialog acts are used to guide response generation. Right: Detailed structure for a single walker cell. A walker cell calculates the similarity between the entities on a graph and the context embedding that integrates utterance embedding and user portrait. Entity selection is learned by logistic regression to enable multiple selections.
to guide the conversation. To solve this issue, Zhou et al. (2020b) incorporates topic threads to enforce transitions actively towards final recommendation, but it models CRS as an open-ended chit-chat task, which does not fully utilize relations between items and their attributes in response. In contrast, CRS can be regarded as a variation of task-oriented dialog system that supports its users in achieving recommendation-related goals through multi-turn conversations (Tran et al., 2020). Inspired by the use of dialog acts (Traum, 1999), we choose a set of system dialog acts in CRS to facilitate information filtering and decision making as task-oriented dialog policy (Takanobu et al., , 2020 does.

CR-Walker: Conversational Recommendation Walker
In this section, we start from defining the key concepts of knowledge graph and dialog acts used in CR-Walker. As illustrated in Fig. 2, CR-Walker works as follows: First of all, dialog history is represented in two views: one is utterance embedding in the content view, and the other is user portrait in the user interest view. Then, CR-Walker makes reasoning on a KG to obtain a reasoning tree, which is treated as a dialog act. Afterwards, the treestructured dialog act is linearized to a sequence, on which CR-Walker finally generates responses with a conditional language generation module.

Key Concepts
We construct a knowledge graph G = (E, R) as follows: the entities E on the graph are divided into three categories, namely candidate items, attributes, and generic classes. There are various relations R among these entities. Each candidate item is related to a set of attributes, while each attribute is connected to its corresponding generic class. There might also exist relationships between different attributes. Taking movie recommendation as an example, the candidate movie Titanic is linked to attributes Romance, Leonardo DiCaprio and James Cameron, and these three attributes are linked to generic classes Genre, Actor and Director, respectively. We also define a set of system actions in CRS. We abstract three different system intents to represent actions commonly used in a dialog policy: recommendation that provides item recommendation and persuades the user with supporting evidence, query that asks for information to clarify user needs or explore user preference, and chat that talks on what has been mentioned to drive the dialog naturally and smoothly. Example utterances of three intents are shown in Fig. 1. Then, we define a dialog act A as an assembly of a system intent and entities selected by the system, along with their hierarchy relations.

Reasoning Process
CR-Walker learns to reason over KG to select relevant and informative entities for accurate recommendation and generating engaging conversations. Considering the large scale of KG and different system actions in CRS, we design several two-hop reasoning rules to help CR-Walker narrow down the search space, thereby making the reasoning pro-  cess more efficient on large KG. As shown in Table  1, all the reasoning rules are designed in line with the conceptual definition of corresponding intents. The reasoning process of CR-Walker starts from one of the three intents. It then tries to explore intermediate entities as a bridge to the final recommendation, and finally reaches the target entities at the second hop.
As explained in Sec. 2, multiple entities can be selected at each hop in CRS, therefore forming a tree structure on the graph instead of a single path as in previous work (Moon et al., 2019). The child entities at the second hop are neighboring to their parent entities at the first hop on the graph, except when the intent is "recommend". We allow all candidate items to be recommended, even if some of them have no connection with other entities on the graph. In addition, we maintain the status of each entity whether the entity is mentioned or not in the context, to facilitate reasoning during interaction.

Dialog and Knowledge Representation
In this subsection, we describe how to represent dialog context, external knowledge and user interests in CR-Walker.
Utterance Embedding We formulate the dialog history D = {x 1 , y 1 , . . . , x t−1 , y t−1 , x t }, where x t and y t is user/system utterance respectively. At each dialog turn t, we first use BERT (Devlin et al., 2019) to encode last system utterance y t−1 and current user utterance x t successively. The embedding of "[CLS]" token of x t is applied as the turn's representation, denoted as BERT([y t−1 ; x t ]). Then the utterance embedding u t is obtained simply through a LSTM over BERT([y t−1 ; x t ]) to capture the sentence-level dependencies. Formally, The hidden state of LSTM u t ∈ R d is taken as the utterance embedding to represent dialog context.

Entity Embedding
To introduce external structured knowledge in CR-Walker, we extract KG from DBpedia (Auer et al., 2007) and add generic classes (see Sec. 3.1). We encode the graph using R-GCN (Schlichtkrull et al., 2018), by virtue of its capability of modeling neighboring connections more accurately by considering different relations. Formally, for each entity e ∈ E, the entity embedding h (l) e ∈ R d at each layer l is calculated as: where N r e denotes the set of neighboring entities of e under the relation r, and W (l) r ,W (l) 0 ∈ R d×d are learnable matrices for integrating relation-specific information from neighbors and the current layer's features respectively. At the final layer L, the embedding h (L) e is taken as the entity representation, and is denoted as h e ∈ R d in the following text.
User Portrait We build a user portrait to represent user interests using both dialog and KG here. Given the dialog history, we performed named entity recognition (NER) to identify entities mentioned in the previous user utterances {x 1 , . . . , x t−1 , x t } using spaCy, then linked them to the entities in the KG with simple fuzzy string matching. The status of identified entities is updated as "mentioned". We thus obtain all the representation of mentioned entities M t ∈ R d×|Mt| by looking up entity embedding: Following Chen et al. (2019), we calculate the user portrait p t ∈ R d via self-attention:

Tree-Structured Graph Reasoning
The reasoning process of CR-Walker initiates from a system intent as the start point on the graph, and expands into multiple paths to get a reasoning tree.
First of all, we treat intent selection as a simple 3-way classification problem parameterized by θ i : Noting that we only use utterance embedding u t as input, since we empirically find that introducing p t does not promote the performance of intent selection. To expand a system intent into a reasoning tree, we propose the walker cell, a neural module shown in Fig. 2. Particularly, each time a walker cell C performs one-hop reasoning to select entities, to expand the tree from a given intent i, or a given entity e at hop n=1 or >1 respectively. It first integrates the dialog history representation via a gate mechanism to obtain context embedding c t : where i t ∈ R d indicates trainable embedding of the selected intent. The cell then outputs the score of each entity e using its entity embedding h e : The estimated selection scoreŝ e indicates whether e is selected for tree expansion. By incorporating c (j<n) t , the current reasoning hop n is aware of the previous reasoning hop j. We describe this process of applying a single walker cell for entity selection from e (similar for intent i) as a function below: where Z (n) e is the set of legal entities to be selected starting from e according to the reasoning rule in Sec. 3.2, and τ is a threshold hyper-parameter.
In practice, we select at most m entities at hop 1 to control the reasoning tree's width. The reasoning path stops when no entities are selected at a certain hop or reaches hop 2.

Conditional Language Generation
Having selected the entities on the reasoning tree, we generate system response y t conditioned on the user utterance x t and tree-structured dialog act A t . We formulate this as a language generation problem. The goal is to build a statistical model parameterized by θ g as follows: To facilitate response generation using a pretrained language model (PLM), we convert the dialog act into a token sequence. As a dialog act of CR-Walker contains an intent and selected entities, and it is arranged in a tree structure, we can linearize the dialog act into a token sequence in the same way that a parser serializes a tree into a string with preorder traversal. As shown in Fig. 2, the brackets characterize the hierarchy of the dialog act with regard to the logical order of entity selection.
In this paper, we employ GPT-2 (Radford et al., 2019) as the backbone for response generation, where the model successively encodes the user utterance x t and sequence dialog act A t as input, and then decodes the response y t in an auto-regressive generation process. During inference, Top-p sampling (Holtzman et al., 2020) is used for response decoding.

Model Optimization
At each turn t, we train the parameters of walker cells θ w at each hop n using standard logistic regression loss: where s e ∈ {0, 1} is the label indicating the selection of entity e , and E (n−1) t denotes the extracted entity set at dialog turn t at hop n-1 3 . Training the generation model is performed via maximizing the log-likelihood (MLE) of the conditional probabilities in Eq. 8 over the user utterance: Noting that we use the extracted dialog acts in the corpus during training.
We jointly optimize all trainable parameters mentioned above. The final loss for optimization L is a weighted sum of all losses 4 : 3 Specially, E  We use two public conversational recommendation datasets to verify the effectiveness of CR-Walker.
(1) ReDial (Li et al., 2018) is collected by crowdsourcing workers from Amazon Mechanical Turk (AMT). Two paired workers are assigned with a role of either recommender or seeker. At least 4 different movies are mentioned in each conversation. Each movie mentioned in the dialog is annotated explicitly.
(2) GoRecDial (Kang et al., 2019) is collected in a similar way using ParlAI. In each dialog, each worker is given a set of 5 movies with corresponding descriptions. The seeker's set represents his or her watching history, and the recommender's set represents candidate movies to choose from. The recommender should recommend the correct movie among the candidates to the seeker. We then construct the KG and perform entity linking separately for GoRecDial and Redial. 5

Baselines
We have compared CR-Walker with several strong approaches in Redial: (1) ReDial (Li et al., 2018): a benchmark model of ReDial that applies an autoencoder recommender, a RNN-based NLG and a sentiment prediction module. We also adopt several conversation recommendation methods as the baselines in GoRecDial: (1) BERT (Devlin et al., 2019): A BERT fine-tuned on GoRecDial, which encodes dialog contexts and movie descriptions. BERT features are used for response retrieval and movie recommendation. (2) R-GCN+GPT: A joint model combining a R-GCN (Schlichtkrull et al., 2018) for movie recommendation with a Transformer-based language model (Vaswani et al., 2017) for response generation. The movies are scored using similar structures within 5 The KG construction details and dataset statistics are shown in the appendix. our walker cell by calculating the dot-product between encoder hidden states and R-GCN embeddings. (3) GoRecDial (Kang et al., 2019): a benchmark model of GoRecDial, which is trained via multi-task supervised learning and bot-play learning by formulating the recommendation task as a task-oriented game.

Automatic Evaluation
The results on Redial and GoRecDial are shown in Table 2, 3 and 4. As can be seen, CR-Walker outperforms most baselines in both item recommendation and response generation.

Recall@1 Recall@10 Recall@50
Size of depth 1: m = # ize of : m = # Size of depth 1: m = # Figure 3: CR-Walker's recommendation performance with regard to the number of selected nodes at the first hop during reasoning. Most metrics improve as more supporting entities are allowed to be selected.

Item Recommendation
We evaluate CR-Walker on item recommendation quality in different settings using metrics proposed in the original datasets: In Redial, we adopt Recall@k for evaluation since there are multiple movies recommended in a dialog. In GoRecDial, since the ground-truth movie to recommend is annotated in each dialog, we evaluate the hit rate among top-k recommendation at each turn (T@k), and also the hit rate only at the end of each dialog (C@k) to observe the usefulness of conversation further. On Redial, we also use Coverage to evaluate recommendation diversity, which is calculated by the proportion of candidate items recommended on test set.
In Table 2, we can find that CR-Walker outperformed all baselines using a single KG on recommendation quality, including ReDial, DCR and KBRD. This indicates use of multi-path reasoning can more effectively utilize background knowledge. KGSF uses an additional KG from Con-ceptNet (Speer et al., 2017) compared with ours, and performs slightly better on Recall. However, CR-Walker can obtain a performance gain as well by incorporating ConceptNet as additional feature (+ConceptNet in Table 2), and even outperforms KGSF on Recall@1 and Recall@10, but this is not the focus of this paper. Regarding recommendation diversity, CR-Walker outperformed all baselines including KGSF. The tree structured reasoning enables multiple items to be recommended at the second hop, each with its certain attributes related to earlier conversation. This results in a higher coverage of candidate items compared with 1-hop reasoning that directly arrives at recommendation.
In Table 4, we can find that CR-Walker obtains the best performance on all recommendation metrics if the user has a clearer preference. Surprisingly, we also find that T@1 is close to C@1 in CR-Walker in GoRecDial. This is because entity embedding provides overly strong information to distinguish the correct movie from only five candidates so that it can offer good recommendations easily, even without user utterances.

Response Generation
We apply BLEU and Distinct-n (Li et al., 2016) to measure the generated response on word-level matches and diversity. Noting that different from Chen et al. (2019) that calculate sentence-level Distinct, we use corpus-level Distinct to give a more comprehensive assessment. Following , we also adopt knowledge F1-score to measure knowledge exploitation. Unlike metrics in item recommendation, the knowledge score is calculated by corresponding generic classes rather than the exact match. For example, it only evaluates whether the system mentioned the genre to promote movie recommendation but does not care about the exact genre.
Results show that CR-Walker outperforms all baselines on corpus-level language diversity by a large margin (dist-2,3 in Table 3). Noticeably, while CR-Walker achieves the highest BLEU in GoRecDial, BLEU in ReDial drops a little when incorporating tree-structured reasoning into the response generation process (26.6 vs. 28.0). This is because CR-Walker sometimes infers different reasoning trees, and afterwards generates sentences that differ from the references but may also be reasonable. We resort to human evaluation (Sec. 5.3) to further evaluate the language quality.
In addition, CR-Walker obtains the best knowledge recall and F1 scores. This indicates that CR-Walker reasonably utilizes informative entities during conversational recommendation. A slightly lower precision in CR-Walker is also because it produces different reasoning trees.

Ablation Study
To understand CR-Walker's superiority against other baselines, we further examine the influence of tree-structured reasoning on the recommendation performance. We first study the effect of tree depth. When we simplify the reasoning process by removing the first hop reasoning and force the model to directly predict the entities at the second hop (-depth=1 in Table 2), there is an apparent decline in all Recall@k. R-GCN+GPT shares a similar framework with CR-Walker-depth=1 which directly recommends items using R-GCN, and CR-Walker outperforms it much on item recommendation. These results demonstrate that two-hop graph reasoning better exploits the connection between entities by exploring intermediate entities, and it is crucial for successful recommendation.
We then study the effect of tree width. We control the width of reasoning paths by setting the maximum number of entities m allowed to be selected at the first hop, and observe the performance in Recall@k, as shown in Fig. 3. Overall, the performance of CR-Walker increases as m goes up. Though there is a slight decrease in Recall@1 when the width grows to around 6, the performance gains in the end. This can be interpreted as multi-path reasoning is superior to single-path reasoning by providing the model with multiple guidance to arrive at the final recommendation.

Human Evaluation
In addition to automatic evaluation, we conduct point-wise human evaluation. 300 posts are randomly sampled from the test set. For each response generated by each model, we ask 3 worker from AMT to give their ratings according to each metric with a 3-point scale (3/2/1 for good/fair/bad respectively). The average score of each metric is reported. Among the metrics, fluency and coherence focus on the response generation quality, informativeness and effectiveness evaluate whether the conversation is well-grounded in a recommendation scenario. In particular, informativeness evaluates whether the system introduces rich movie knowledge, and effectiveness evaluates whether the system engages users towards finding a movie of interest successfully.
We present human evaluation results on ReDial in Table 5. We adopt GPT-2 as an additional baseline fine-tuned on the training set and generates response directly. We find that it serves as a solid baseline due to the success of PLMs in language generation and incorporating knowledge implicitly. Although GPT-2 cannot make actual recommendation since it does not "select" a movie, it outperforms all the previous baselines even on informativeness and effectiveness. This implies that finding the appropriate recommendation is insufficient to satisfy users under the conversational rec-   ommendation setting, but the quality of natural language may also determine how well recommendations will be accepted. CR-Walker, equipping the PLM with external knowledge and reasoning ability, further boosts GPT-2's performance by providing interpretable recommendation through utterance. Among all the metrics, CR-Walker improves informativeness and effectiveness more significantly. We observe that CR-Walker can generate utterance with more detailed attribute information to support recommendation compared to GPT-2 alone. This demonstrates that CR-Walker succeeds in generating engaging responses with tree-structured dialog acts beyond PLMs. We further study why CR-Walker can outperform human responses. In terms of the system action, the intent accuracy of CR-Walker reaches only 67.8%, but we find that a different intent from the human's choice sometimes results in better informativeness and effectiveness. We calculate the score separately for humans and CR-Walker based on whether the intent selection is the same or different in Table 6. For identical intents, CR-Walker's improvements on four metrics are all marginal, as the improvement only comes from providing more information at the first hop reasoning. For different intents, however, the human performance drops remarkably, while our performance remains consistent. We observe several samples and find that the human usually performs perfunctory chit-chat like "haha" or "lol" in these cases. By contrast, CR-Walker replies with a relevant query or appropriate recommendation 6 . This implies that the score advantage may come from the explicit reasoning on system actions that CR-Walker learns.

Recommendations in Dialog Flow
We also analyze the flow of recommended items throughout conversation among various interaction cases, where we roughly categorize the flow into two patterns. In one pattern, the seeker chats around a fixed topic of interest and ask for similar recommendations. This pattern is common on Redial, and CR-Walker efficiently handles it by making appropriate recommendation through tree structure reasoning. However, in a less common case where user suddenly change to a new topic, earlier recommendations would have little effect on the latter ones. In these cases, reasoning on previous items may result in inappropriate recommendations. In practice, we weigh the two patterns by setting the maximum length of dialog history l max , where we only used last l max utterances in D to compute utterance embedding and user portrait. When we set l max = 3, we empirically find CR-Walker can handle most topic changes while still providing appropriate recommendation during interaction.

Conclusion and Future Work
We have presented CR-Walker, a conversational recommender system that applies tree-structured reasoning and dialog acts. By leveraging intermediate entities on the reasoning tree as additional guidance, CR-Walker better exploits the connection between entities, which leads to more accurate recommendation and informative response generation. Automatic and human evaluations demonstrate CR-Walker's effectiveness in both conversation and recommendation. It is worth noting that the dialog acts used in CR-Walker are automatically obtained by entity linking to a KG with simple heuristics. Therefore, our work can be easily applied to different conversational recommendation scenarios.
There are still some topics to be explored based on CR-Walker. It would be promising to equip CR-Walker with a language understanding module to capture users' negative feedback and propose other reasoning rules to handle such situations. An

A Notation
Notations used in this paper are summarized in Table 7.

B Pseudocode
The entire reasoning and training process of CR-Walker is described in Algorithm 1.

C Implementation Details
In experiments, we train the model on a single Tesla-V100 GPU with a learning rate of 1e-3, batch size of 36, and max epoch of 60. Adam is used as the optimization algorithm, with a weight decay of 1e-2. We set the max number of selection at the first hop m = 5 during training, and used negative sampling for the candidate items (second hop when system intent is recommend). The ratio between negative and positive samples is set to 5. The dimension of entity embedding d is set to 128. The layer size of R-GCN L is set to 1. BERT-base and GPT-2-medium are applied from Wolf et al. (2020) and the parameters of the BERT encoder are frozen during the training process. The weights of graph walker loss at each hop are λ 1 = 1, λ 2 = 0.1 for GoRecDial and λ 1 = 1, λ 2 = 1 for Redial, respectively. During inference, we apply τ = 0.5 as the entity selection threshold and p = 0.9 for the response decoding strategy. Bag of words (BOW) of the movie description are encoded using a fully connected layer as additional features in GoRec-Dial. During KG construction, the generic classes we introduce are the director, actor, time, genre, subject related to each movie. All entities are directly extracted from DBpedia, except for genres, which are taken from MovieLens. There are 12 types of relationships between entities, namely actor of / director of / genre of / subject of / time of / is a

D Case Study
We finally present an interactive case here to demonstrate our model's capability during interactive dialog and our model's explainable nature. The sequential dialog acts corresponding to the reasoning tree generated by CR-Walker is presented in Table 10 along with the dialog. We mark all the mentioned entities either in bold (user turn) or in colors (system turn) according to the reasoning hop. The dialog starts with greetings between the user and CR-Walker, followed by CR-Walker proactively seeking user preference by asking which kind of movie he or she likes. The following few turns focus on recommending action movies, and CR-Walker provides an appropriate description of the recommended movies and some comments on Arnold Schwarzenegger's muscles. The topic then switches to horror movies after the user explicitly requires scary ones, with the system recommending four appropriate movies within two turns. The dialog finally ends with the user expressing gratitude and CR-Walker expressing goodwill. Overall, at the utterance level, the whole dialog contains appropriate amounts of information and various dialog acts from the model, enabling the conversation to appear coherent and fluent.
The intermediate dialog acts that CR-walker gen- erates help us to better control and understand the generated utterance. On one hand, the entity on the reasoning tree provides additional insight into the model's particular statement. Generated sentences may contain the entity name directly, but may also contain paraphrase of entities, as in cases of Genre, 1980s and Horror mapping to kind of films, old and scary respectively. The model also learns to omit some of the entities on the reasoning path based on the dialog context, such as entity Horror when the system recommended Shining and It. Such non-trivial paraphrasing would be hard to interpret in the absence of the reasoning tree. On the other hand, the reasoning tree's structure even gives a hint to the approach our model takes when it mentions an entity. An interesting case happens in the third turn of the dialog when CR-Walker recommends Die Hard. The predicted dialog intent appears to be "chit-chat", and Die Hard is selected at hop 2 in the reasoning process during inference. As a result, the system talks about the attributes of Die Hard (use of Action) instead of directly recommending it, and the tone taken by the model is more casual and relevant to the previous context (use of then and comment of all that muscle). Together, the above advantages add to our model's explainability, giving our model the edge to be interpreted beyond words.