KERS: A Knowledge-Enhanced Framework for Recommendation Dialog Systems with Multiple Subgoals

.


Introduction
Recommendation dialog systems recently attract much attention due to their significant commercial potential (Chen et al., 2019;Jannach et al., 2020). Such systems first elicit user preferences through conversations and then provide high-quality recommendations based on elicited preferences.
Many real-world recommendation applications usually involve chitchat, question answering, and recommendation dialogs working together (Wang et al., 2014;Ram et al., 2018). Various social interactions build rapport with users and gain trust. To provide more sociable recommendations, Liu et al. (2020) proposed a conversational recommendation dialog dataset DuRecDial annotated with 21 subgoals, where the dialog system starts the * Corresponding author conversation with some non-recommendation subgoals, such as chitchat and question answering to collect user information and build social relationships and finally progresses into a recommendation subgoal. Subgoals can be seen as different dialog phases. Figure 1 shows an example dialog with multiple subgoals. All the subgoals are designed to complete the final recommendation.
An RNN-based multi-goal driven conversation generation framework (MGCG) was proposed to address this task by Liu et al. (2020). MGCG first models the subgoals separately to plan appropriate subgoal sequences for topic transitions and final recommendations. Then MGCG extracts knowledge features from the whole knowledge graph and produces responses to complete each subgoal. However, MGCG did not investigate how to effectively use knowledge in different subgoals. As shown in Figure 1, a conversation often involves a relatively large knowledge graph and multiple subgoals. Both the question answering and the recommendation processes require assistance from accurate knowledge information. Therefore, having rich and accurate knowledge is essential in generating engaging conversations. Since taking all possible knowledge as input will lead to more noise and high computation, how to select useful knowledge in different subgoals is important.
We propose KERS to use knowledge effectively in multi-subgoal conversational recommendation tasks. In order to control the flow of the conversation, we develop a dialog guidance module that predicts a sequence of subgoals and selects useful external knowledge information with respect to each subgoal to improve generation performance. In addition, we propose a sequential attention mechanism, a noise filter, and a knowledge enhancement module to make generated responses more informative. Specifically, the sequential attention mechanism enhances subgoal guidance, the noise filter eliminates unrelated and unnecessary knowledge,  Figure 1: An example of rich knowledge in multi-subgoal recommendation dialog. The conversation is grounded on a knowledge graph. The task can be viewed as completing multiple subgoals sequentially. Text in red indicates knowledge related information and red arrows indicate selected knowledge triple. and the knowledge enhancement module increases the importance of the selected knowledge in response generation. Both automatic and manual evaluations suggest that KERS has a better performance compared to state-of-the-art methods.

Related Work
Most previous work in recommendation dialog systems focused on slot-filling methods to collect user preferences and recommend items (Reschke et al., 2013;Christakopoulou et al., 2016;Sun and Zhang, 2018;Christakopoulou et al., 2018;Lee et al., 2018;Lei et al., 2020). To study more sociable and informative recommendation conversations, ; Moon et al. (2019); Zhou et al. (2020b) proposed new recommendation dialog datasets with knowledge graphs, and incorporated knowledge into response generation. Kang et al. (2019) created a dialog dataset with clear goals. Chen et al. (2019) captured knowledgegrounded information and used recommendationaware vocabulary bias to improve the quality of language generation.
Recently, Liu et al. (2020) proposed utilizing subgoal sequences to plan dialog paths and presented a new recommendation dialog dataset DuRecDial. They demonstrated that establishing a subgoal sequence is crucial for natural transitions and successful recommendations. Some previous works (Moon et al., 2019;Wu et al., 2019;Zhou et al., 2020b) also introduced topic transition approaches similar to the subgoal transition to improve the quality of open-domain dialogs. They built the topic path by either traversing on a knowledge graph or predicting knowledge items directly. Similar to Liu et al. (2020), Hayati et al. (2020 utilized sentence-level sociable recommendation strategy labels in the INSPIRED dataset to improve the recommendation success rate. However, the INSPIRED dataset was not annotated with specific dialog subgoals. Some relevant works for our project focused on obtaining knowledge information from all the related knowledge triples (Liu et al., 2020;Chen et al., 2019), or enhancing the semantic representations by incorporating both word-oriented and entity-oriented knowledge graphs (Zhou et al., 2020a). However, our work differs because it has fine-grained knowledge planning and accurate knowledge incorporation in generation. Moreover, we deal with more complex knowledge graphs, including both sentences and entities.

Method
KERS consists of three modules: a dialog guidance module (section 3.1), an encoder (section 3.2), and a decoder (section 3.3), as shown in Figure 2. The decoder incorporates three new mechanisms, a sequential attention mechanism, a noise filter, and a knowledge enhancement module.  For each conversation turn, the dialog guidance module predicts the subgoal of the turn and selects knowledge for the next response. Then, the encoder encodes the subgoal, the selected knowledge, and the dialog context. Finally, the output of the encoder is fed to the decoder to generate the final dialog system response.

Dialog Guidance Module
To produce proactive and natural conversational recommendations, we propose a dialog guidance module to customize a reasonable sequence of subgoals and provide proper candidate knowledge. This module accomplishes two subtasks: subgoal generation and knowledge generation. To predict the next turn's subgoal G next , we use a Transformer (Vaswani et al., 2017) based model conditioning on a context X, a knowledge graph K, a user profile P, and a final recommendation subgoal G T . We define K ′ as a set of P and K, and optimize the following loss function: where g next i denotes the token in G next . Then we input the predicted subgoal into another Transformer to get the candidate knowledge K c . Because there is no labeled knowledge in ground-truth responses, we obtain pseudo labels in an unsupervised manner. We first concatenate the knowledge items in the tuple (head, relation, tail). Then we compute the char-based F1 score (Wu et al.,   between each knowledge and the ground-truth response. Finally, we take the knowledge items with F1 scores greater than a threshold (thr = 0.35) as the pseudo label K w . We optimize the following loss function to train a knowledge generator: where k w i is the token in head or relation. We do not need to generate a complete tuple (head, relation, tail), because only head and relation are needed to obtain specific knowledge items. Then, we select the knowledge items matching the generated tuple (head, relation) as the candidate knowledge K c . Finally, the dialog guidance module outputs G ′ next = [G next ; G T ] (the concatenation of the predicted subgoal G next and the final recommendation subgoal G T ) and K c for next stage processing.

Encoder
To incorporate different types of information, we use a vanilla Transformer block as our encoder. We encode context, candidate knowledge selected and the subgoals predicted by the dialog guidance module independently, since they have different structures. In addition, the input embedding includes word embedding, type embedding, and positional embedding, as shown in Figure 3. The multi-type embeddings help the encoder distinguish different parts of the context better (Wolf et al., 2018). Formally, the outputs of the encoder are computed as follows:

Decoder
We propose three new mechanisms to incorporate in a Transformer based decoder to generate informative responses consistent with the predicted subgoal. We describe the three mechanisms, a sequential attention mechanism, a noise filter, and a knowledge enhancement module in details below. The decoder produces responses as follows:

Sequential Attention Mechanism
The sequential attention mechanism is designed to enhance subgoal guidance by simulating human cognitive process. Humans first form an overall idea of a recommendation and then pitch the recommendation given the current conversation context. So we make the decoder first processes the different parts of the encoder outputs at different layers and then combine these layers in a particular order that resembles human cognition. Specifically, the Transformer based decoder extracts features as follows: where MultiHead(Q, K, V) is the multi-head attention operation described in Vaswani et al. (2017). Y p is the previous decoded tokens. I(·) is the embedding function of the input and NF(·) indicates the process of the noise filter. In this structure, the model captures valid information in the context and the knowledge based on the subgoals and then generates more coherent responses that are consistent with these subgoals.

Noise Filter
Although we can generate high-quality candidate knowledge, there is still erroneous candidate knowledge that can lead to an unexpected response. Moreover, since the recommender does not always provide knowledge-related responses in conversations, the excessive input of knowledge can create more noise. To address these problems, we propose a noise filter to select better knowledge items, shown in Figure 4. We filter the knowledge features by a knowledge gate. Specifically, the filter first takes the previous layer output O G as a query to extract the features of context encoding E C and knowledge encoding E K by multi-head attention: Then, the knowledge gate computes a reduction weight α k according to the matching degree of knowledge and context. Finally, the filter averages context features and knowledge features using α k ∈ [0, 1] as outputs O KG : where W k is a trainable parameter. The noise filter controls the flow of knowledge. When responses are not knowledge-related, or the knowledge is not associated with the context, the reduction weight α k decreases and vice versa.

Knowledge Enhancement Module
To further generate more informative responses, we propose a knowledge enhancement module to put more emphasis on retrieved knowledge through a set of learned weights. Specifically, we take the words in knowledge K ′ as the knowledge lexicon. Then we compute the weighted probability distri-Model Accuracy CNN (Liu et al., 2020) 94.13 LSTM-CNN 95.48 Ours 96.60 butions of words using a weight α g ∈ [0, 1]: where W g and W v are trainable parameters. α g controls the weight of generating a general word. A low value of α g indicates highlighting the words in the knowledge lexicon. In the training process, the model automatically learns to enhance the generation probability of the knowledge words at proper steps. The introduced knowledge enhancement module can not only help the model produce more informative responses but also increase the presence of the selected knowledge in responses.

Training Objective
Because that each module completes different functions, we train the model in two stages. First, we optimize the subgoal generation loss L G and the knowledge generation loss L K for the dialog guidance module. Then, we optimize the following cross-entropy loss between the predicted word distribution P o and ground-truth distribution o: 4 Experiments

Dataset and Training Details
DuRecDial is a dataset for recommendation dialog with annotated subgoals (Liu et al., 2020) in Mandarin. Two crowd workers are assigned different profiles in the recommendation task with a diverse set of subgoals. There are four main categories of subgoals: 1) Chitchat: greeting, chitchat about celebrities, etc; 2) Question answering: answering questions on weather, celebrities, movies, restaurants, music, time, etc; 3) Recommendation: recommending movies, news, music, restaurants, etc; 4) Task: requesting news, playing music, delivering weather reports. DuRecDial contains 10,190 recommendation dialogs, 21 subgoals and 222,198 knowledge triples. We split the dataset into train/dev/test data with a ratio of 6.5:1:2.5. Figure 1 shows an example dialog. We implement KERS in PyTorch 1 . Both the encoder and decoder contain six Transformer blocks. Each Transformer block uses 12 attention heads. The word embedding and hidden state sizes are both set to 768. We use a similar encoder-decoder structure that is used for generating responses to accomplish the subgoal generation and knowledge generation task. The vocabulary size is 30,000. The maximum context length is 768.

Baseline Models
We compare KERS against several baselines: • S2S+kg: We implement the seq2seq model as described in Vinyals and Le (2015) with the attention mechanism and concatenate all the related knowledge and the context as its input. • Trans.+kg: We use a knowledge encoder to extract knowledge features. We concatenate knowledge features and the context as the Transformer model's input.
• MGCG_G, MGCG_R: We use the generation and retrieval models based on the MGCG framework introduced by Liu et al. (2020).
To validate the effectiveness of each component, we conduct ablation studies as follows: (1) KERS w/o DiaGuidance: without the dialog guidance module; (2) KERS w/o Subgoal: without subgoal information input in the decoder; (3) KERS w/o CandidateKnow: without the candidate knowledge input in the decoder; (4) KERS + Topic: without the candidate knowledge but with the predicted topic as described in Liu et al. (2020); (5) KERS w/o NoiseFilter: without the noise filter; (6) KERS w/o KnowEnhance: without the knowledge enhancement module; (7) KERS + Reverse: KERS first extracts context and knowledge features, then extracts subgoal features; (8) KERS + Monolayer: using the monolayer attention mechanism; (9) KERS + AllKnowledge: with all the related knowledge rather than the candidate knowledge.  Moreover, we perform automatic evaluations on two subtasks: subgoal generation and knowledge generation. We compare KERS against: (1) CNN: the CNN (Kim, 2014) model used in Liu et al. (2020); (2) LSTM-CNN: adding LSTM (Hochreiter and Schmidhuber, 1997) before CNN.

Automatic Evaluation Metrics
We evaluate the models on the original DuRec-Dial test set. We use perplexity (PPL), F1 (Liu et al., 2020), BLEU (Papineni et al., 2002), and DISTINCT (DIST-2) (Li et al., 2016) for common automatic evaluation. Perplexity and DISTINCT measure the fluency and the diversity of generated responses, respectively. F1 and BLEU measure the similarity between the generated responses and ground truth. In addition, we compare the training time (minutes/epoch) for efficiency. We propose a knowledge F1 score to evaluate selected knowledge's accuracy. Knowledge F1 is the F1 score computed between the generated response and the pseudo label (aka K w described in Section 3.1). To evaluate two subtasks, we compute subgoal prediction accuracy and knowledge prediction accuracy.

Experimental Results
We first evaluate the effectiveness of subgoal prediction and knowledge prediction. Table 1 2 Since MGCG_R is a retrieval-based model and has poor results, we mainly compare our model with MGCG_G.
shows subgoal prediction accuracy. Our model achieves the best performance on subgoal prediction (96.60%) compared to CNN and LSTM-CNN. In addition, our model achieves relatively high accuracy 75.6% on knowledge prediction, which serves a solid base to guide response generation.
We present the response generation results in Table 2. Our model, KERS achieves a significant improvement over previous work MGCG_G in perplexity (PPL) by -8.17, F1 +14.45, BLEU-1 +0.1226, BLEU-2 +0.1268, DIST-2 +0.0216, and knowledge F1 +15.36. Notably, KERS has the lowest perplexity and highest knowledge F1, indicating it has the best fluency and knowledge. Due to the advantages of the retrieval model, MGCG_R has high DIST-2, which suggests MGCG_R has more diverse responses. We also conduct an ablation study to evaluate each component's contribution to KERS's performance. Results show that after removing the dialog guidance module, KERS's performance decreases sharply. This suggests that the dialog guidance module plays a crucial role by providing reasonable subgoals and selecting proper knowledge later. Moreover, removing the predicted subgoals leads to worse performance but higher DIST-2. However, after careful inspection of responses generated by KERS w/o Subgoal, we find that these diverse responses are largely irrelevant to the current scene. Therefore, even though these responses are more diverse, they do not lead to suc-   Table 4: Pair-wise preference of the three models cessful recommendations. We also find that using turn-level candidate knowledge boosts knowledge F1 compared to using subgoal-level topics. This is because turn-level candidate knowledge provides more fine-grained information, which guides response generation. Although our knowledge prediction has a relatively high accuracy of 75.6%, there are still 24.4% incorrect cases -some of them do not need knowledge, and some of them receive the wrong knowledge. The noise filter is designed to address these cases, which improves all the metrics, especially improving F1 by 3.0%. In addition, we find removing the knowledge enhancement module sharply decreases KERS's DIST-2. We also observe the sequential attention mechanism performs better than both the reverse attention and monolayer structure. This indicates that a reasonable attention sequence enables the model to utilize subgoals and knowledge information better. Furthermore, KERS has better results than KERS+AllKnowledge, especially improving knowledge F1 by 6.3%, and only requires half of its training time. This suggests that rather than improving performance, incorporating all the knowledge introduces noise and leads to more training time. Our model can filter unnecessary information and is more efficient and effective.

Human Evaluation
Automatic metrics evaluate the model on several specific aspects, while humans can give a holistic evaluation. We conduct human evaluations on both turn level and dialog level to compare three models, KERS, MGCG_G, and Trans.+kg. In addition, we run a pair-wise preference test among these models.

Turn-level Evaluation
We randomly sample 200 examples from the test set and let each model generate a response according to a given context, related knowledge graph, and the final recommendation subgoal. We present the generated responses to five human evaluators. They assess the responses in terms of fluency, appropriateness, informativeness, and proactivity using a 3-point Likert scale.
The results are shown in the left portion of Table  3. The inter-rater annotation agreement is measured using the Fleiss's kappa (Fleiss and Cohen, 1973). The Fleiss's kappa for fluency, appropriateness, informativeness, and proactivity is 0.81, 0.76, 0.77, and 0.60, respectively. Our model outperforms all the baselines, especially on appropriateness and informativeness. This indicates that KERS can generate more appropriate and informative responses. Moreover, we find both MGCG_G and KERS obtain relatively higher scores than Trans.+kg on proactivity, suggesting that providing subgoal planning is vital in guiding dialogs.

Dialog-level Evaluation
We ask human evaluators to have conversations directly with the models through an interactive interface. Since there are 21 different subgoals with different requirements and a large number of different subgoal sequences, we have to train evaluators with the guidelines to effectively evaluate the models. Because such training is time-consuming and requires high proficiency, we recruited ten professional evaluators to perform the evaluation instead of recruiting crowd workers. To make sure evaluators can cover a wide range of different conversation contexts, we ask each evaluator to interact with the models in 6 different scenarios sampled  from the test scenarios. In total, 60 different scenarios are tested. After conversing with the dialog model, evaluators are asked to measure the dialog in terms of recommendation success, coherence, and engagingness with a 5-point Likert scale.
As shown in the right portion of Table 3, our model achieves a significant improvement in all the three metrics. It shows that KERS can complete different dialog types and finally make successful recommendations better than the baseline models.

Pair-wise Preference Test
We also conduct pair-wise comparisons on our model against baseline models. We ask ten evaluators to talk to both models under the same 60 scenarios selected in the dialog-level evaluation and select the better model. We show results in Table 4. KERS (t-test, p < 0.05)) is preferred by evaluators over MGCG_G and Trans.+kg. This suggests KERS performs better than previous stateof-the-art models.

Case Study
To show the models' recommendation quality, we provide some examples. As shown in Table 5, KERS first answers the user's question correctly and talks about his favorite star Xiaoming Huang to engage the user. KERS then talks about Xiaoming Huang's awards and honors which gains user's trust.
Finally, KERS successfully recommends the movie Women Who Know How to Flirt Are the Luckiest starring Xiaoming Huang to users. Compared to KERS, MGCG_G recommends the inappropriate movie The Bullet Vanishes that is unrelated to the user's preferred star Xiaoming Huang. Trans.+kg recommends the correct movie title but mistakenly thinks Women Who Know How to Flirt Are the Luckiest is a song. We can also find that without the precise control of knowledge-aware response generation, both MGCG_G and Trans.+kg usually give wrong answers to questions. These observations indicate that accurate and rich knowledge is significant for the recommendation process.

Conclusions
It is vital to provide an informative and appropriate recommendation process in conversational recommendation with multiple dialog types. To improve recommendation quality, we present KERS to enhance the generated knowledge's accuracy and richness in responses. Our model uses a dialog guidance module to provide the proper subgoals and candidate knowledge, ensuring that the model interacts with the user in a planned way. In addition, we propose three new mechanisms: a sequential attention mechanism, a noise filter, and a knowledge enhancement module in the decoder. These mechanisms work together to increase the amount and accuracy of knowledge in responses. Experimental results show that KERS completes various subgoals and obtains state-of-the-art results compared to previous models. In the future, we plan to further leverage knowledge graph's path to enhance natural topic transitions in dialogs.

Ethical Considerations
Recently, recommendation dialog systems have developed rapidly, and we must consider ethical principles in both the design and development stages. First, The ultimate goal of the recommendation system is to provide users with content that they need. Therefore, the recommended content needs to be fair. The over-recommendation of a certain content due to the business relationship of interest undermines fairness. Second, the internal mechanism of the system must be transparent, so that users have a way to understand the nature of the system to avoid malicious sales. Similarly, during the operation of the recommendation dialog system, the collection of user information must be approved by the user to prevent the system from being used to collect user privacy. Finally, the recommended content cannot be factually false or misleading. For example, recommending misleading news will lead to the spread of rumors. The system needs to monitor the recommended content to solve such problems.