RevCore: Review-augmented Conversational Recommendation

Existing conversational recommendation (CR) systems usually suffer from insufficient item information when conducted on short dialogue history and unfamiliar items. Incorporating external information (e.g., reviews) is a potential solution to alleviate this problem. Given that reviews often provide a rich and detailed user experience on different interests, they are potential ideal resources for providing high-quality recommendations within an informative conversation. In this paper, we design a novel end-to-end framework, namely, Review-augmented Conversational Recommender (RevCore), where reviews are seamlessly incorporated to enrich item information and assist in generating both coherent and informative responses. In detail, we extract sentiment-consistent reviews, perform review-enriched and entity-based recommendations for item suggestions, as well as use a review-attentive encoder-decoder for response generation. Experimental results demonstrate the superiority of our approach in yielding better performance on both recommendation and conversation responding.


Introduction
With the increasing popularity of intelligent assistants in users' daily lives, how to effectively help users find information or finish specific tasks, such as recommendation and booking, has tremendous commercial potential. Therefore, conversational recommendation (CR) systems have attracted widespread attention for being a tool providing users potential items of interest through dialogue-based interactions. Though existing studies (Sun and Zhang, 2018;Lei et al., 2020) proposed to integrate recommender and dialogue components for providing ( ) Corresponding Author 1 Our code will release in https://github.com/ JD-AI-Research-NLP/RevCore.

U1: Hi could you recommend a comedy? Something like The
Heat or Bad Boys? Bad boys was a really fun movie to watch. It has some intense action sequences.

S1: Great! Have you seen The Good Guys? With Will
Ferrell and Mark Wahlberg.

U2: No. I haven't. Is it good?
S2: It's great. One early particular scene in the movie in which Ferrell and Wahlberg argue over whether a lion or a tuna would win in a fight is so well.

… Conversational Recommendation
Figure 1: An illustrative example of a user-system conversation on movie recommendation. The additional sentiment-matched reviews are in red. Items (movies) and entities (e.g., actors) are in bold.
user-specific suggestions through conversations, CR remains challengeable because (i) typical dialogues are short and lack sufficient item information for user preference capturing (Chen et al., 2019;, and (ii) difficulties exist in generating informative responses with item-related descriptions (Shao et al., 2017;Ghazvininejad et al., 2018;Wang et al., 2019b). Thus, recently, external information in the form of structured knowledge graphs (KG) is introduced to enhance item representations by using rich entity information in KG (Chen et al., 2019;. While KGbased methods improve CR to some extent, they are still limited in (i) worse versatility resulted from a high cost of KG construction; and (ii) inadequate integration of knowledge and response generation (Lin et al., 2020). Given that, nowadays, users are greatly encouraged to share their consumption experience (e.g., restaurant, traveling, movie, etc.), reviews are easily accessed over the internet. Such reviews often provide rich and detailed user comments on different factors of interest, which are crucial in suggest- The overview of the proposed method in a movie recommendation scenario, where "emb", "SA", "CA", and "sf" denote embedding, self-attention, cross-attention, and softmax operation, respectively.
ing recommendations to particular users. Thus one can treat reviews as promising external sources for higher-quality recommendations in a conversation.
As an example shown in Figure 1, the CR system may be unfamiliar with the mentioned items from the user, resulting in an uninformative response "It's great.", thus the chat does not help with recommendation owing to lacking necessary knowledge. In addition, another factor resulting in users' lower acceptance rates to the recommendations is that elaborations on the suggestion are seldom given, which can be alleviated with more explanatory or descriptive utterances after referring to reviews.
Therefore, in better linking external knowledge to recommendation in dialogues, in this paper, we propose a novel framework, Review-augmented Conversational Recommender (RevCore), to enhance CR by additional review data. In doing so, we firstly analyze user's utterances with their sentiment polarities and then retrieve reviews for the items mentioned by the user with keeping their sentiment matching the utterances (e.g., they should be both positive or negative). The obtained reviews are thus recommendation-beneficial Hariri et al., 2011) because they are given by the ones who have seen/used and also show interests (or with no interests) in the mentioned items. Afterward, we incorporate the selected reviews into dialogue history, from which the CR system can learn user preference from review-enriched item information. In addition, we also use the sentimentcoordinated reviews to enhance the dialogue response generation, where a review-attentive decoder introduces item information from selected reviews to generate coherent and informative responses. To the best of our knowledge, it is the first time that the aforementioned CR issues have been addressed through incorporating external reviews. Experimental results on a widely used benchmark dataset  show that RevCore is superior on both recommendation accuracy and conversation quality. Further analyses are also performed to confirm the effectiveness of RevCore in an appropriate manner of introducing reviews to CR.

The Proposed Framework
We present the proposed Review-augmented Conversational Recommender (RevCore) with its overview illustrated in Figure 2, where there are three main components, i.e., the review retrieval module, the recommendation component, and the conversation component. The review retrieval module takes a conversation context C as the input and outputs the selected review set R from the review database R db which contains all reviews. The context C = {s t } N t=1 consists of all utterances s t of the dialogue history given by the user and system in turns, and the review set R includes all review sentence r ∈ R db retrieved according to the contexts in previous turns. With C and R as the input, the recommendation component outputs a set of items from the candidate item set Z as the recommendation. The dialogue component also accepts C and R as input, and outputs an utterance as the response, where w i is the i th word and M the length of s t+1 . The output s t+1 is added to the context of the next turn.
We first introduce how to retrieve proper reviews from a database in the following Section 2.1. Then our solutions to the recommendation and conversation tasks are described in Section 2.2 and Section 2.3 respectively, along with detailed illustrations of how reviews enhance both two tasks. Without the loss of generality, our method is introduced in a movie recommendation scenario.

Review Retrieval
To help dialogue with reviews, given R db , it is of great importance to retrieve proper ones. The reasons are two folds: (i) non-relevant reviews may result in harmful effects to the user representation; (ii) reviews with inconsistent altitudes inject noise into the conversation, which impedes generating coherent responses. Then, a preliminary retrieval is to search in R db for proper reviews according to the mentioned item in the conversation context C.
For review filtering, we design a sentiment-aware retrieval module. The sentiment value v ∈ [0, 1] of each review r can be captured by a transformerbased sentiment predictor: where Sentiment(·) denotes sentiment prediction, and v can be viewed as how well the movie is liked in this review r. Similarly, the sentiment of a response to this movie can also be obtained in this way. As a result, reviews that possess similar sentiment polarity v * with the response are selected. Considering helpful reviews are usually long paragraphs, we only retain part of them, one sentence, for each mentioned movie. Given a context C, there exist two manners to select the sentences r (C) from the raw reviews, word-wisely (or phrase-wisely) and sentence-wisely. The first one randomly chooses some words or phrases to form each "sentence", and one whole sentence is directly selected at random in a second way. Despite the expense of sentence fluency, the first manner enjoys much variability due to the extensive word/phrase combinations. The process to obtain r (C) can be formulated as follows: where Retrieve(·) denotes the retrieval operation and V is the set of all v. The obtained r (C) is added into the review set R.
With the retrieved review sentence, one way of incorporation is to briefly insert it right behind the sentence where the item is, as in Fig. 1. However, it may cause the perturbation to the conversational consistency by interrupting the original dialogue. Thus we seamlessly incorporate the review embedding into the conversation component, which is described in Section. 2.3. More importantly, the review sentence serves as a brief introduction or explanation to the mentioned movie. It enriches user information for personalized recommendations and introduces external knowledge for more informative recommendation responses.

Review-augmented Recommendation
The recommender component is constructed based on a KG-based framework , with all entities in the context are extracted to generate the embedding of a user profile. In our method, the retrieved reviews work on enriching entity information so that the user embedding can be augmented to promote recommendation accuracy.
Similar to the approach in , a candidate entity embedding dictionary E is constructed first by using GNN to learn entity representations from KG, e.g., DBpedia (Auer et al., 2007). Given a context C, all entities E (C) in it are extracted. Then the embedding vectors of them are looked up from E and concatenated into a matrix E (C) ∈ R l (C) ×d , where l (C) is the number of entities in the context C, and d denotes the embedding dimension. Next, the entity embedding E (C) is aggregated into a user embedding vector u (C) , through a self-attention layer (SA) as follows: where α is the attention weight vector, and W α and b are the parameter matrix and vector for linear projection and bias. Given u (C) , a multi-layer perceptron (MLP) and a softmax operation are adopted to obtain the recommendation prediction p ∈ R L , where L is the number of candidate movies: To learn parameters in the recommender component, a cross-entropy loss L rec between the prediction p and the target movie category is computed: where M is the number of recommendations and p * i is the prediction probability of the target category in the i th recommendation.
The pipeline described above suffers from the entity sparsity in dialogue history, resulted from the dataset construction process, where annotators are inevitably unfamiliar with some movies. Retrieved reviews can act to enrich the E (C) by adding more entity words.The process of obtaining reviewenriched entities can be formulated as: where extract(·) defines the entity extraction operation, and E denotes entities extracted from the context and retrieved review, respectively. Based on the review-enriched entities, the user embedding is expected to be better represented to produce a more precise recommendation.

Review-augmented Response Generation
Reviews can also augment the response generation in the conversation component. We build an encoder-decoder framework to handle the generation task. Retrieved reviews and context are encoded separately first, for the purpose of maintaining the dialog consistency. In the decoding stage, the review embedding is fused via an attention layer to generate informative responses. Considering that a good modeling of the input plays an important role to achieve an outstanding model performance (Mikolov et al., 2013;Song and Shi, 2018;Peters et al., 2018;Devlin et al., 2019;Song et al., 2021) and transformer-based approaches have achieved state-of-the-art in many NLP tasks (Vaswani et al., 2017;Chen et al., 2019;Chen et al., 2020a;Joshi et al., 2020;, we adopt two transformers as the encoders for context and reviews. Given a context C and the retrieved reviews R, the context embedding X (C) and review embedding R (C) are first obtained: where θ X , θ R are parameters in these two transformers. The decoding stage takes them and the entity embedding E (C) as inputs of attention layers. These attention layers aim to fuse the external information from KG and reviews R into the context information, inspired by the work of . Given the decoding output of last time unit Y i−1 , the current one Y i is generated by: where MHA(Q, K, V) represents the multi-head attention function (Vaswani et al., 2017), which takes a query, key, and value as input: where [·] represents the concatenation operation, h is the number of heads, and W i is the parameter matrix to learn. FFN(·) in Equation 8 defines a fully-connected feed-forward network, which comprises of two linear layers with one ReLU activation layer in between: As presented above, information is injected progressively into the decoding stage, from the original context at first, then related entity information in KG, and finally reviews, which contain detailed item-related information.
To complete the generation, the decoder output Y i is processed through a softmax operation to predict the token distribution. Apart from the conversational consistency required in the chit-chat task, the CR system also expects recommendation-related responses, which usually contain relevant entities and descriptive keywords. So a copy mechanism is further adapted to introduce vocabulary bias and thus increase the informativeness in the generation. Given the previous generated sub-sequence {y i−1 } = y 1 , y 2 , · · · , y i−1 , the generation probability y i of the next token can be computed as: where Pr 1 (·) is a generation probability function over the vocabulary, with Y i as the input. G and R represents the knowledge graph and reviews we use. Pr 2 (·), Pr 3 (·) are copy probability functions from KG entities and reviews, respectively, implemented by a standard copy mechanism ) (computing the distributions over the KG words or review words). Both probability functions are implemented with a softmax operation. To learn the response generation in the dialogue component, we set a cross-entropy loss: log Pr(s t |s 1 , · · ·, s t−1 ) , (12) where N is the number of turns, s t represents the t th utterance in the conversation.
To train the whole model, it includes three steps: (i) pre-training the sentiment predictor in the review retrieval module; (ii) training the recommender component by minimizing L rec ; (iii) training the dialogue component by minimizing L gen .

Dataset
REDIAL ) is a widely-used dataset of real-world conversations around the theme of providing movie recommendations generated by the human in seeker-recommender pairs. REDIAL contains 10,021 conversations related to 64,362 movies, split into training, validation, and test sets using a ratio of 8:1:1 2 . To construct a review database, we crawled 30 reviews for each movie from IMDb 3 website, which is one of the most popular and authoritative movie databases. Each review can be queried according to the corresponding movie along with its rating and helpful score provided by IMDb. In practice, we select the 30 reviews with the highest helpful scores for each movie to guarantee the high quality of collected reviews. Other manners of selecting the 30 reviews are described and compared in the second part of Section 4.4.

Implementation Details
The maximum lengths of context and response are set to 256 and 30, respectively. Transformers for review encoding in dialogue generation and sentiment prediction use the same hyper-parameters with the context encoder. For sentiment polarity in the reviews, we threshold on the star-rating to getting sentiment polarity with the threshold set to 5. In the dialogue context, the sentiment polarity is obtained according to users' attitude to the mentioned entity in utterances, which is provided by the REDIAL dataset. Other settings are kept consistent with  for fair comparison 4 . Besides, the "review sentence" is selected according to the sentiment value and in a sentence-wise manner, and the token number of incorporated review sentences is set to 20, considering the balance between the original source and external source. We add the retrieved review sentences after the mentioned items in the dialogue component training to guide it to generate review-aware responses. The sentiment predictor for reviews is trained on the collected reviews. The sentiment predictor for dialog context is trained on the IMDb Movie Reviews Dataset (Maas et al., 2011) and then finetuned on the REDIAL dataset.

Baselines
Evaluated on the REDIAL dataset, we compare our approach with a variety of competitive baselines from previous studies listed as follows: • Trans (Vaswani et al., 2017) applies a encoderdecoder framework based on transformer for generation, and applies a transformer encoder to encode context information for recommendation.
• Redial  builds a conversation component based on a hierarchical encoderdecoder architecture, and its recommender component is implemented by an auto-encoder extended with a RNN-based sentiment analysis module.
• KBRD (Chen et al., 2019) adopts DBpediaenhanced contextual items or entities to construct user profile for recommendation. The KGenhanced user profile also serves as word bias for the transformer-based generation module.
• KGSF  uses MIM (Viola and Wells III, 1997) to align the semantic spaces of two KGs. The user embedding is obtained from the aligned representations of words and items for recommendation. The generation module follows a transformer encoder and a fused KG enhanced decoder.

Evaluation Metrics
Our method is evaluated on both the recommendation and conversation tasks. The evaluation metric for recommendation is Recall@k (R@k, k = 1, 10, 50), which indicates whether the predicted topk items contain the ground truth recommendation provided by human recommenders. Conversation evaluation comprises automatic and human evaluation. The metrics for automatic evaluation are perplexity (PPL) (Jelinek et al., 1977) and distinct n-gram (Dist-n, n = 2, 3, 4) . Perplexity is a measurement for the fluency of natural language, where lower perplexity refers to higher fluency. Distinct n-gram is a measurement for the diversity of generated utterances. Specifically, we use distinct 3-gram and 4-gram at the sentence level to evaluate the diversity. The main purpose of our dialog component is a successful recommendation rather than imitating the ground truth responses. Therefore, we provide annotators to manually evaluate the results instead of using BLEU scores. The annotators evaluate the quality of generated dialogue responses from 3 aspects, i.e., coherence, fluency, and informativeness, with each score ranging from 0 to 1.

Evaluation on Recommendation Task
For the recommendation task, we adopt Recall@k (R@1, R@10, R@50) for evaluation. As the results summarized in Table 1, our approach outperforms all competitive baselines and achieves 5.9% R@1, 24.0% R@10, and 41.3% R@50, which is the stateof-the-art performance on the REDIAL dataset. 5 Compared with KGSF, RevCore (+KG) achieves significant improvements, with R@1 score improved about 156% (absolutely 2.2), R@50 score improved about 129% (absolutely 4.5), and R@50 score improved about 120% (absolutely 7.6). We also evaluate the performance of RevCore (−KG), which means the construction of E removes relation between entities. Instead, an embedding matrix is randomly initialized and learned to represent each entity, without using the GNN-based 5 We report the performance of different models on the validation sets in Appendix D and the mean and standard deviation of the test set results in Appendix E.

Models
Dist-2 Dist-3 Dist-4 PPL  embedding. In this version, the external knowledge source we introduce is reduced to review only. As the result in the last two rows of Table 1, RevCore (−KG) can achieve competitive results with RevCore (+KG), and outperform KGSF that uses two KGs. According to our observation, although the learning of entity representation is made harder without structured knowledge graphs, the enrichment of dialogue history by reviews makes up the embedding learning. It demonstrates that incorporating reviews is a meaningful method to improve the recommendation in the conversation. We hope this result inspire further research.

Evaluation on Conversation Task
Automatic Evaluation The results of automatic evaluation on the REDIAL dataset summarize in Table 2. The proposed RevCore outperforms all competitive baselines and achieves significant improvements over most of the automatic metrics. Compared with KGSF, all of the Dist-n scores are significantly lifted, namely, by +0.14 for Dist-2, +0.11 for Dist-3, and +0.08 for Dist-4, which demonstrates our method is effective to generate diverse utterances. Besides, RevCore (+KG) achieves a comparable PPL score with KGSF. It validates our claim that the review incorporation in our method does not cause a decline in generation fluency. The lower PPL score of RevCore (+KG) possibly relates to the high fluency contained in incorporated reviews that carefully induct by website users. For the version of RevCore (−KG), it achieves higher Dist-n scores than KGSF and only results in a slight drop compared with RevCore (+KG). It demonstrates that reviews compared with KG bring more diversity as a richer and more accessible external source.

Human Evaluation
We adopt human evaluation on a random selection of 100 multi-turn dialogues   from the testing set. Given one dialogue context, each generated response is scored ranging from 0 to 1, with a higher value indicating a more coherent, fluent, and informative utterance. The final result is calculated as the average score of three annotators, as summarized in Table 3. The proposed RevCore (with or without KG) is consistently better than all the baselines, especially on the metric of informativeness in a large margin. It further proves the effectiveness of our method, and also verifies its superiority in numerical results.

Ablation Study
We demonstrate the contribution of each part on the conversation task by constructing an ablation study based on three variants of our complete model, including: (1) RevCore (−revCP) by removing the copy mechanism for reviews, (2) RevCore (−revRA) by removing the review attention layers from the transformer decoder, and (3) RevCore (−revEN) by removing the sentiment-aware review encoder (the reviews share the same encoder with the context). As shown in Table 4, first, all the techniques are useful to improve the final performance in generating diversified utterances. Besides, the copy mechanism and the review attention layers seem to be more important in conversation diversity. One of the potential reasons is that these two components are directly related to the decoding stage. Separated encoders for review and context lead to a slight increment, which shows that sharing a common encoder is an alternative solution.

Case Study
In this part, we present a visualized example to illustrate how our model works in practice, as shown in Figure 3. First, the sentiment-aware review retrieval module guarantees the coherence of incorporating reviews to some extent, for example in Figure 3, negative comments (the last row for the movie The Notebook) are filtered out considering the positive attitude in the original utterance. Secondly, incorporated reviews exactly enrich the context for better recommendation. As seen in the first yellow frame, many entities like "Roshan Andrews" mentioned in the review are added into the entity set. Note that some other entities are also added from the reviews that are incorporated into users' utterances as described in Section 2, which is not visualized here but brings recommendation accuracy improvement as well. Last but not least, the generated responses are more informative to use more varied expressions like "the magic spell" and "the sultry dance". Besides, they uncover more details related to the movie that acts as explanatory sentences that make recommendations accepted more easily and naturally.

Discussion
Longer Review, Better Performance? In our basic setting, each retrieved review sentence is formed by 20 words. We conduct a series of experiments by setting the length of retrieved review sentences as 10, 20, 30, 40, and 50 to inspect the effect of review length. The results of using different lengths are shown in Table 5, among which 20 is the best across all metrics. An interesting finding is that continually increasing the review length makes no benefits after reaching 20. Over introducing external text may suppress original text, thus 20 is a better choice to keep the balance between them.  Figure 3: Case study. U (i) (green) and S(i) (yellow) represent user and system, respectively. In the "Dialogue Generation", items are marked in blue font, explanatory sentences are in red. Items are in bold font in "Entities" frames. In "Recommend" frames, a darker color represents a higher probability. "Review Retrieval" gives retrieved review examples, with their sentiment value (0-9) at the most left, and the selected reviews are in bold.  Appropriate Reviews Help More? The "review sentence" are obtained from a 3-stage process, namely, searching item-matched reviews from the database, ranking them by the helpful score or sentiment value, and constructing "review sentence" word-wisely or sentence-wisely. Therefore, we conduct control experiments to inspect these three factors. As shown in Table 6, (i) using reviews randomly matched with items (R-H-W) results in significantly lower R@k and Dist-n scores; (ii) ranking by sentiment value (C-S-S) leads to better performance across all metrics than by helpful score (C-H-S), which demonstrates the necessity of using sentiment-aware review retrieval; (iii) the sentence-wise manner (C-H-S) gets a lower PPL than the word-wise one (C-H-W), which is reasonable because the incorporated reviews made up of random words causes the fluency loss. Besides, another experiment is conducted to verify the necessity of using a movie-review database. A foodreview database is constructed as a topic-irrelevant corpus (iCorpus), which results in the lowest R@k yet not bad Dist-n scores. It shows that despite the response diversity brought by the external corpus, the unrelated entities from another domain have negative impacts on the recommendation accuracy.

Related Work
Recommender systems have emerged as a separate research area and now play an indispensable role in daily social lives. Traditional recommender systems tend to work statically, primarily relying on content-based approaches or the collaborative filtering hypothesis (Resnick et al., 1994;Pazzani and Billsus, 2007;Wang et al., 2019b), which assumes that similar users may have similar interests. Afterward, more sophisticated methods using neural networks are proposed and prove effective. For instance, neural factorization machines (He and Chua, 2017) and deep interest networks  are used to estimate user preferences based on historical user-item interactions. Graphs are adopted in Wang et al. (2019b,a) to model complex relations among users, items, and attributes for a better representation of data. In recent years, major advances made in dialog systems (Dodge et al., 2016;Benni et al., 2016;Bordes et al., 2017; and structured knowledge-based info-seeking technics including question answering (Bao et al., 2014Yin et al., 2015;Yih et al., 2015;Shao et al., 2019) and question generation (Serban et al., 2016;Bao et al., 2018;Dušek et al., 2020) have encouraged the development of conversational recommendation systems, which dynamically obtain user preferences through interactive conversation with users. Multiple datasets have been constructed (Dodge et al., 2016;Kang et al., 2019) to facilitate the study of this task.  collect a standard human-to-human multi-turn dialog dataset focusing on providing movie recommendations. Based on these datasets, various approaches are proposed to address different issues in CR systems. Specifically, external information is introduced to alleviate the coldstart problem, including knowledge bases (Wang et al., 2018), social networks (Daramola et al.), and knowledge graphs (Chen et al., 2019). Christakopoulou et al. (2016) use bandit-based exploreexploit strategy to minimize the number of user queries.  conduct multi-goal planning to make a proactive conversational recommendation over multi-type dialogues. A multi-view method is proposed in Chen et al. (2020b) for the explainable conversational recommendation. The work of Pecune et al. (2020) builds a socially aware CR system engaging its users through a rapportbuilding dialogue to improve users' perception.
Different from all aforementioned previous work, we offer an alternative to AIG with an augmented conversational recommendation system by incorporating reviews that highly relevant to items. Particularly, our model is able to learn better user representations from a review-enriched dialogue context, which enables a high-quality recommendation and response generation.

Conclusion
In this paper, we proposed a novel CR framework with review augmentation, including a sentimentaware retrieval module, a recommender exploiting the review-enriched user profile, an encoder for enhancing semantic embedding of selected reviews, and a review attentive decoder to integrate review information for dialogue response generation. Experimental results show that our approach achieves consistent and significant improvements of both recommendation and dialogue responding over baselines, and is able to generate informative responses without losing fluency and coherence.

A Statistics for Conversation Dataset and Reviews
Conversations in the REDIAL dataset consist of 163,820 utterances, of which 15.80% have reviews added. The vocabulary size of REDIAL is increased by 13.14% (from 23,356 to 26,427). Among the mentioned 6,927 movies in all conversations, 40% of them are randomly chosen and linked with reviews to keep the balance between the original source and external source. We count the ratio of "disliked" movies by the recommender to explain the improvements brought by doing sentiment-aware retrieval when incorporating reviews. We also show the ratio of unseen movies by the recommender to show the need of introducing reviews to "talk more". Comprehensive statistics are listed in Table 7.

B Experiment Details
Hyper-parameter Settings For a fair comparison, most hyper-parameters are kept consistent with KGSF. We did not search for more hyperparameters combinations to achieve additional improvements apart from our main idea. The shared hyper-parameters include: embedding dimension  Table 7: Statistics for the REDIAL dataset with incorporated reviews. "Cnd" denotes "Candidate", and "Mnt" denotes "Mentioned" to indicate the movies that mentioned in the conversations.  Table 8: Comparison of three models on the number of parameters (million), training ("Tra") time for 30 epochs (second), and inference ("Inf") time (second). set as 128 in the recommender component and 300 in the dialogue component, the layer number of both GNN in the KG module as 1, the batch size as 32, word embedding initialization via word2vec 6 , the optimizer as Adam, the learning rate as 0.001, the epoch number as 30, etc.

Training Strategies
To train the whole model, three steps are included: (i) pre-training the sentiment predictor in the review retrieval module; (ii) training the recommender component by minimizing L rec ; (iii) training the dialogue component by minimizing L gen . In the first step, the predictor takes each sentence in the review as input and outputs the sentiment, with the corresponding rating set as the label. In the second and third steps, our implementation refers to the training algorithm for the KGSF model. It first pre-trains the parameters in KG for entity representation by minimizing the Mutual Information Maximization loss between two KG embedding, then trains the recommender component by minimizing the recommendation loss and also updating the parameters in the KG module, and finally the dialogue component by minimizing the generation loss with all other modules' parameters "frozen".

C Model Size and Running Speed
The model size and running speed of KGSF, RevCore (+KG), and RevCore (−KG) are all listed in Table 8. Note that all three models are implemented with Pytorch 7 , trained for 30 epochs, and experimented on NVIDIA A100-SXM4 for 5 times to compute the average running time.

D Results on the Validation Set
We present the validation result of RevCore with and without KG on the REDIAL dataset as a reference for reproducing. All validation results are shown in Table 9, with test results as well.

E Mean and Standard Deviation
We implement the major experiment 4 times to inspect the mean and standard deviation of the performance of RevCore across all metrics. The reported results in the paper of both recommendation accuracy and conversation quality are the mean results. Results are shown in Table 10.