Learning Neural Templates for Recommender Dialogue System

The task of Conversational Recommendation System (CRS), i.e., recommender dialog system, aims to recommend precise items to users through natural language interactions. Though recent end-to-end neural models have shown promising progress on this task, two key challenges still remain. First, the recommended items cannot be always incorporated into the generated response precisely and appropriately. Second, only the items mentioned in the training corpus have a chance to be recommended in the conversation. To tackle these challenges, we introduce a novel framework called NTRD for recommender dialogue system that can decouple the dialogue generation from the item recommendation. NTRD has two key components, i.e., response template generator and item selector. The former adopts an encoder-decoder model to generate a response template with slot locations tied to target items, while the latter fills in slot locations with the proper items using a sufficient attention mechanism. Our approach combines the strengths of both classical slot filling approaches (that are generally controllable) and modern neural NLG approaches (that are generally more natural and accurate). Extensive experiments on the benchmark ReDial show our approach significantly outperforms the previous state-of-the-art methods. Besides, our approach has the unique advantage to produce novel items that do not appear in the training set of dialogue corpus. The code is available at https://github.com/jokieleung/NTRD.


Introduction
Building an intelligent dialogue system that can freely converse with human, and fulfill complex tasks like movie recommendation, travel planning and etc, has been one of longest standing goals of natural language processing (NLP) and artificial intelligence (AI). Thanks to the breakthrough * Equal contribution. Work performed when Zujie Liang was an intern at Microsoft STCA. † Corresponding author: djiang@microsoft.com.
in deep learning, the progress on dialogue system has been greatly advanced and brought into a new frontier over the past few years. Nowadays, we are witnessing the booming of virtual assistants with conversational user interface like Microsoft Cortana, Apple Siri, Amazon Alexa and Google Assistant. The recent large-scale dialogue models such as DialoGPT , Meena (Adiwardana et al., 2020) and Blender (Roller et al., 2021), demonstrate the impressive performance in practice. Besides, the social bots such as Xi-aoIce (Shum et al., 2018) and PersonaChat (Zhang et al., 2018a) also exhibit the great potential on the emotional companion to humans. The conversational techniques shed a new light on the search and recommender system, as the users can seek information through interactive dialogues with the system. Traditional recommender systems often rely on matrix factorization methods (Koren et al., 2009;Rendle, 2010;Wang et al., 2015;He et al., 2017), and suffer from the coldstart problem (Schein et al., 2002;Lika et al., 2014) when no prior knowledge about users is available. On the other hand, existing recommendation models are trained on offline historical data and have the inherent limitation in capturing online user behaviors (Yisong, 2020). However, the user preference is dynamic and often change with time. For instance, a user who usually prefers science fiction movies but is in the mood for comedies, would likely get a failed recommendation.
In recent years, there is an emerging trend towards building the recommender dialogue system, i.e., Conversational Recommendation System (CRS), which aims to recommend precise items to users through natural conversations. Existing works Chen et al., 2019;Zhou et al., 2020a;Ma et al., 2020) on this line usually consist of two major components, namely a recommender module and a dialogue module. The recommender module aims at retrieving a subset of items that meet the user's interest from the item pool by conversation history, while the dialogue module generates free-form natural responses to proactively seek user preference, chat with users, and provide the recommendation. To incorporate the recommended items into the responses, a switching network (Gulcehre et al., 2016) or copy mechanism (Gu et al., 2016) is utilized by these methods to control whether to generate an ordinal word or an item at each time step. Such integration strategies cannot always incorporate the recommended items into generated replies precisely and appropriately. Besides, current approaches do not consider the generalization ability of the model. Hence, only the items mentioned in the training corpus have a chance of being recommended in the conversation.
In this paper, we propose to learn Neural Templates for Recommender Dialogue system, i.e., NTRD. NTRD is a neural approach that firstly generates a response "template" with slot locations explicitly tied to the recommended items. These slots are then filled in with proper items by an item selector, which fully fuses the information from dialogue context, generated template and candidate items via the sufficient multi-head self-attention layers. The entire architecture (response template generator and item selector) is trained in an endto-end manner. Our approach combines the advantages of both classical slot filling approaches (that are generally controllable) and modern neural NLG approaches (that are generally more natural and accurate), which brings both naturally sounded responses and more flexible item recommendation.
Another unique advantage of our NTRD lies in its zero-shot capability that can adapt with a regularly updated recommender system. Once a slotted response template is generated by the template generator, different recommender systems could be plugged into the item selector easily to fill in the slots with proper items. Thus, NTRD can produce the diverse natural responses with the items recommended by different recommenders.
The contributions of this work are summarized as follows: (1) We present a novel framework called NTRD for recommender dialogue system, which decouples the response generation from the item recommendation via a two-stage strategy; (2) Our NTRD first generates a response template that contains a mix of contextual words and slot locations explicitly associated with target items, and then fills in the slots with precise items by an item selector using a sufficient attention mechanism; (3) Extensive experiments on standard dataset demonstrate our NTRD significantly outperforms previous state-of-the-art methods on both automatic metrics and human evaluation. Besides, NTRD also exhibits the promising generalization ability on novel items that do not exist in training corpus.

Related Work
In this section, we first introduce the related work on task-oriented dialogue system. Then we review the existing literature on Conversational Recommender Systems (CRS), which can be roughly divided into two categories, i.e., attribute-centric CRS and open-ended CRS.

Task-oriented Dialogue System.
From the methodology perspective, there are two lines of the research on the task-oriented dialogue system, i.e., modular approaches (Young et al., 2013) and end-to-end approaches (Serban et al., 2016;Wen et al., 2017;Bordes et al., 2017;Zhao et al., 2017;Lei et al., 2018). Recent works like GLMP (Wu et al., 2019) and dynamic fusion network (Qin et al., 2020) make attempt to dynamically incorporate the external knowledge bases into the end-to-end framework. Wu et al. (2019) introduce a globalto-local memory pointer network to RNN-based encoder-decoder framework to incorporate external knowledge in dialogue generation. By contrast, our approach gets rid of pointer network paradigm and proposes a two-stage framework, which is modeled by the transformer-based architecture.
Attribute-centric CRS. The attribute-centric CRS conducts the recommendations by asking clarification questions about the user preferences on a constrained set of item attributes. This kind of systems gradually narrow down the hypothesis space to search the optimal items according to the collected user preferences. The various asking strategies have been extensively explored, such as memory network based approach (Zhang et al., 2018b), entropy-ranking based approach , generalized binary search based approaches (Zou and Kanoulas, 2019;Zou et al., 2020), reinforcement learning based approaches (Sun and Zhang, 2018;Hu et al., 2018;Lei et al., 2020a;Deng et al., 2021;, adversarial learning based approach (Ren et al., 2020) and graph based approaches Lei et al., 2020b;Ren et al., 2021;. Most of these works (Christakopoulou et al., 2018;Zhang et al., 2018b;Deng et al., 2021) retrieve questions/answers from a template pool and fill the pre-defined slots with optimal attributes. Although this kind of systems are popular in the industry due to the easy implementation, they are still lack of the flexibility and the interactiveness, which leads to the undesirable user experience in practice.
Open-ended CRS. Recently, researchers begin to explore the more free-style item recommendation in the response generation, i.e., open-ended CRS Chen et al., 2019;Liao et al., 2019;Kang et al., 2019;Zhou et al., 2020a;Ma et al., 2020;Hayati et al., 2020;Zhou et al., 2020b;Zhang et al., 2021). Generally, this kind of systems consist of two major components, namely a recommender component to recommend items and a dialogue component to generate natural responses.  make the first attempt on this direction. They release a benchmark dataset REDIAL that collects human conversations about movie recommendation between paired crowd-workers with different roles (i.e., Seeker and Recommender). Further studies (Chen et al., 2019;Zhou et al., 2020a;Ma et al., 2020;Sarkar et al., 2020;Lu et al., 2021) leverage multiple external knowledge bases to enhance the performance of recommendation.  propose a multi-goal driven conversation generation framework (MGCG) to proactively and naturally lead a conversation from a nonrecommendation dialogue to a recommendationoriented one. Recently, Zhou et al. (2021) release an open-source CRS toolkit, i.e., CRSLab, to facilitate the research on this direction. However, the Pointer Network (Gulcehre et al., 2016) or Copy Mechanism (Gu et al., 2016) used in these approaches cannot be always accurately incorporated the recommended items into the generated replies. Moreover, only the items mentioned in training corpus have a chance of being recommended in conversations by existing approaches.
Our work lies in the research of open-ended CRS. While in our work, we propose to decouple dialogue generation from the item recommendation. Our approach first leverages Seq2Seq model (Sutskever et al., 2014) to generate the response template, and then fills the slots in template with proper items using sufficient multi-head selfattention mechanism. Moreover, our work shows the unique advantage to produce novel items that do not exist in the training corpus.

Preliminary
Formally, a dialogue consisting of t-turn conversation utterances is denoted as D = {s t } N t=1 . Let m denotes an item from the total item set M, and w denotes a word from vocabulary V. At the tth turn, the recommender module chooses several candidate items M t from the item set M, while the dialogue module generates a natural language sentence s t containing a proper item i from M t to make recommendations. It is noteworthy that M t can be equal to ∅ when there is no need for recommendation. In that case, the dialogue module could continue to generate a chit-chat response or proactively explore the user's interests by asking questions. To incorporate the recommended items into the generated reply, a switching mechanism (Gulcehre et al., 2016) or CopyNet (Gu et al., 2016) is usually utilized to control the decoder to decide whether it should generate a word from the vocabulary or an item from the recommender output. Specifically, the recommender predicts the probability distribution over the item set as P rec , and the dialogue module predicts the probability distribution over vocabulary as P dial ∈ R |V | . The overall probability of generating the next token is calculated as follows: where w o represents either a word from the vocabulary or an item from the item set, e is the hidden representation in the final layer of the dialogue module. σ refers to the sigmoid function and W s and b s are the learnable parameters.

Method
In this section, we present the framework of learning Neural Templates for Recommender Dialogue system, called NTRD. As shown in Figure 1, NTRD mainly consists of three components: a recommendation-aware response template generator, a context-aware item selector and a knowledge graph (KG) based recommender. Given the dialogue context, the encoder-decoder based template generator focuses on generating the response template with item slots (Section 4.1). Then the blank slots are filled by the item selector according to the dialogue context, candidate items from the recommender module and the generated response template (Section 4.2). Finally, the entire framework is trained in an end-to-end manner (Section 4.3).

Response Template Generator
To generate the response template, we adopt the Transformer-based network (Vaswani et al., 2017) to model the process. Concretely, we follow Zhou et al. (2020a) to use the standard Transformer encoder architecture and the KG-enhanced decoder which can effectively inject the information from KG into the generation process. Then we add a special token [ITEM] into the vocabulary and mask all items in the utterances of dialogue corpus with [ITEM] tokens. Thus, at each time step, the response template generator predicts either the special token [ITEM] or the general words from the vocabulary. Formally, the probability of generating the next token by the response template generator is given as follows: where W d ∈ R |V|×d e and b d ∈ R |V| are weight and bias parameters, d e is the embedding size of the hidden representation e. After the generation process is finished, these special tokens serve as the item slots in generated templates, which will be filled with the specific items by item selector.

Slot Filling with Item Selector
Now we have the generated response templates, the rest we need to do is filling the slot locations with proper items. Here, we first reuse the KG-enhanced recommender module (Zhou et al., 2020a) to get the user representation given the dialogue context.
The recommender module learns a user representation p u through incorporating two special knowledge graphs, i.e., a word-oriented KG (Speer et al., 2017) to provide the relations between words and an item-oriented KG (Bizer et al., 2009) to provide the structured facts regarding the attributes of items. Given the learned user preference p u , we can compute the similarity between user and item as follows: where h m is the learned embedding for item m, and d h is the dimension of h m . Hence, we rank all the items for p u according to Eq. 4 and produce a candidate set from the total item set. Existing works (Chen et al., 2019;Zhou et al., 2020a) infer the final item only based on the dialogue context. While the generated response template can also provide the additional information for selecting the final item. For instance, as shown in the example of Figure 1, the words "romantic" and "funny" after item slot could provide the contextual semantic information in the response for choosing the item to be recommended.
Motivated by this, we propose a context-aware item selector by stacking sufficient multi-head attention blocks, as shown in Figure 2. Formally, we define the embedding matrix E slot for all the slots in the template, where each slot embedding is the hidden representation from the final layer of transformer decoder. Similarly, the embedding matrix for the remaining tokens in the template is  Figure 2: The overview of our approach. The slotted response template is first generated by the transformer decoder and then the item selector fills in the slots with proper items. Our framework enables sufficient information interaction among the generated template, dialogue history, and candidate items in a progressive manner, which is beneficial to selecting the more suitable items to fill in the slot locations.
defined as E word and the embedding matrix output by the transformer encoder is E ctx . H cand is the concatenated embedding matrix for candidate items. Hence, the calculation in the item selector is conducted as follows: where MHA(Q, K, V ) defines the multi-head attention function (Vaswani et al., 2017) that takes a query matrix Q, a key matrix K, and a value matrix V as input and outputs the attentive value matrix: Note that the layer normalization with residual connections and fully connected feed-forward network are omitted in Eq. 5 for simplicity. By this means, the item selector is able to sufficiently fuse effective information from the generated template, dialogue context and candidate items in a progressive manner, which is beneficial to selecting the more suitable items to fill in the slot locations.
Finally, the item selector predicts a probability distribution over all items and selects the one with the highest score to fill in: where W r ∈ R |Mt|×d e and b r ∈ R |Mt| are weight and bias parameters.

Training Objectives
Though the entire framework is typically two-stage, the two modules can be trained simultaneously in an end-to-end manner. For the template generation, we optimize a standard cross-entropy loss as: where N is the number of turns in a conversation D, s t is the t-th utterance of the conversation. While the loss function for the item selector is calculated as: where |M D | is the number of ground truth recommended items in a conversation D.
We combine the template generation loss and the slot selecting loss as: where λ is a weighted hyperparameter. During the inference, we apply greedy search to decoding the response template s t = (w 1 , w 2 , ..., w s ).
If w i is the special token [ITEM], the item selector will be used to select the appropriate specific item based on the dialogue context, generated template and candidate items. Finally, the completed response will be sent to the user to carry on the interaction.

Dataset
To evaluate the performance of our method, we conduct comprehensive experiments on the REDIAL dataset 1 , which is a recent CRS benchmark . This dataset collects high-quality dialogues for recommendations on movies through crowd-sourcing workers on Amazon Mechanical Turk (AMT). It contains 10,006 conversations consisting of 182,150 utterances related to 6,924 movies, which is split into the training, validation, and test set in an 80-10-10 proportion.

Evaluation Metrics
Both automatic metrics and human evaluation are employed to evaluate the performance of our method. For dialogue generation, automatic metrics include: (1) Fluency: perplexity (PPL) measures the confidence of the generated responses.
(2) Diversity: Distinct-n (Dist-n)  are defined as the number of distinct n-grams divided by the total amount of words. Specifically, we use Dist-2/3/4 at the sentence level to evaluate the diversity of generated responses.
For recommendation task, existing works Chen et al., 2019;Zhou et al., 2020a) individually evaluate the performance on recommendation using Recall@k. However, the goal of open-ended CRS is to smoothly chat with users and naturally incorporate proper recommendation items into the responses. In other words, it is important for the system to generate informative replies containing the accurate items. Hence, we introduce 1 https://redialdata.github.io/website/ a new metric that checks whether the ground-truth item is included in the final generated response, i.e., Recall@1 in Response (ReR@1). Similarly, if the generated response has an item token, we calculate whether the top-k (k=10, k=50) items of the probability distribution for this position contain the ground truth item, i.e., ReR@10 and ReR@50. Besides, we also introduce the Item Diversity that measures the percentage of the recommended items mentioned in the generated response to all items in the dataset. Item Ratio is introduced by Zhou et al. (2020a) to measure the ratio of items in the generated response.
For human evaluation, 100 dialogues are randomly sampled from the test set. Then three crowdworkers are employed to score on the generated responses in terms of Fluency and Informativeness. The range of score is 1 to 3. The higher score means the better. The average score of each metric on these 100 dialogues evaluated by three workers is reported. The inter-annotator agreement is measured by Fleiss' Kappa (Fleiss and Cohen, 1973).

Implementation Details
The models are implemented in PyTorch and trained on one NVIDIA Tesla V100 32G card. For the fair comparison, we keep the data preprocessing steps and hyperparameter settings the same as the KGSF model (Zhou et al., 2020a) in the released implementation 2 . The embedding size d h of the item in recommender module is set to 128, and the embedding size d e in dialogue module is set to 300. We follow the procedure in KGSF to pre-train the knowledge graph in the recommender module using Mutual Information Maximization (MIM) loss for 3 epochs. Then the recommender module is trained until the cross-entropy loss converges. For the training of response template generator, we replace the movies mentioned in the corpus with a special token [ITEM] and add it to the vocabulary. We use Adam optimizer with the 1e − 3 learning rate. The batch size is set to 32 and gradient clipping restricts in [0, 0.1]. The generation loss and the item selection loss are trained simultaneously with the weight λ = 5.

Baselines
We introduce the baseline models for the experiments in the following: • REDIAL : The baseline model proposed by  consists of an auto-encoder (Wang et al., 2015) recommender, a dialogue generation model based on HRED (Serban et al., 2017) and a sentiment prediction model.
• KBRD (Chen et al., 2019): This model utilizes a KG to enhance the user representation. The transformer-based (Vaswani et al., 2017) dialogue generation model uses KG information as the vocabulary bias for generation.
• KGSF (Zhou et al., 2020a): The model proposes to incorporate two external knowledge graphs, i.e., a word-oriented KG and an itemoriented KG, to further enhance in modeling the user preferences.
6 Experimental Results

Evaluation on Dialogue Generation
We conduct the automatic and human evaluations to evaluate the quality of generated responses.
Automatic Evaluation. Table 1 shows the automatic evaluation results of the baseline models and our proposed NTRD on dialogue generation. As we can see, our NTRD is obviously better on all automatic metrics compared to the baseline models. Specifically, NTRD achieves the best performance on PPL, which indicates the generator of NTRD can also generate the fluent response templates. In terms of diversity, NTRD consistently outperforms the baselines with a large margin on Dist-2/3/4. This is because the generated template provides the extra contextual information for slot filling so as to produce more diverse and informative responses.
Human Evaluation. We report the human evaluation results in Table 2. All Fleiss's kappa values exceed 0.6, indicating crowd-sourcing annotators have reached the substantial agreement. Compared to KGSF, our NTRD performs better in terms of Fluency and Informativeness. NTRD decouples the response generation and item injection by first learning response templates and then filling the slots with proper items. Hence, it can generate the more fluent and informative responses in practice.

Evaluation on Recommendation
In this section, we evaluate the performance of recommendation from two aspects, i.e., conversational item recommendation to assess the recall performance and novel item recommendation to investigate the generalization ability.
Conventional Item Recommendation. To further investigate the performance of NTRD on the conventional item recommendation, we present the experimental results of ReR@k (k=1, 10 and 50), Item Diversity and Item Ratio in Table 1. As can be seen, when evaluating the actual performance of recommendation based on final produced responses, the state-of-the-art method KGSF performs poorly with only 0.889% ReR@1. This indicates the switching network in KGSF cannot accurately incorporate the recalled items into the generated responses. It violates the original intention of the open-ended CRS, i.e., to not only smoothly chat with users but also recommend precise items using free-form natural text. By contrast, our NTRD framework performs significantly better, which shows the decoupling strategy brings an obvious advantage of incorporating the precise items into the conversations with users. Furthermore, NTRD achieves the highest item ratio and item diversity. On the one hand, the template generator introduces a special token [ITEM] and thus reduces the size of vocabulary, which would increase the predicted probability of item slot during the generation process. On the other hand, the item selector utilizes sufficient information from dialogue context, generated template and candidate items to help select the high-quality recommended items.
Novel Item Recommendation. Existing methods have one major drawback that they cannot handle the novel items never appearing in the training corpus. To validate the unique advantage of our NTRD on novel item recommendation, we conduct an additional experiment. Specifically, we collect all items from the test set that do not appear in the training set, i.e., 373 novel items in total. To learn the representations of these novel items, we first include them together with other ordinary items in the pre-training of the recommender modules of both KGSF and NTRD. However, when training the dialogue modules, we only use the normal training set where these novel items are excluded. Then we evaluate the models on the test set. As we can see in Table 3, the 13.40% (50 of 373) of novel items can be successfully incorporated into the final responses and thus recommended to the users, while KGSF fails to recommend any of the novel items. This verifies the promising generalization ability of NTRD on the unseen items, which is Model PPL Dist-2 Dist-3 Dist-4 ReR@1 ReR@10 ReR@50 Item Diversity Item Ratio REDIAL  28.    crucial in the real-world settings. Even though the recommender modules would update regularly to include novel items, our framework can still effectively incorporate these items into conversational responses without re-training the dialogue model.

Case Study
In this section, we present an interactive example to demonstrate how our model works in the interactive dialogue. More cases please refer to the Appendix. In Table 4, we mark all the mentioned items in blue color and the user preferences in red color for easy reading. The outputs of three recommenders including human, KGSF and NTRD are presented for the comparison. The dialogue starts with greetings between the user (seeker) and the recommenders, followed by the recommenders proactively seeking user preference by asking which kind of movie he/she likes. With the focused preference of the user on "comedy" movies, the recommenders provide some candidate movie items with the interesting comments. The responses of KGSF tend to be boring and dull, and it does not switch its recommendation item even though the user reject it by saying "I saw that one". Instead, our NTRD behaves more like human and shares a similar taste with the human recommender (both recommend the comedy movie "Shaun of the Dead"). Then when the user expresses intention to watch, both User: I saw that one. A classic parody film.
Human: Also Hot_Fuzz is funny. I really though Ghostbusters was funny. Or you can watch Hangover. KGSF: I would recommend Bad Santa. NTRD: Have you seen Shaun of the Dead? It's a really good one.
User: I am a big fan of both of those movies as well.
Human: But Deadpool 2 (2018) just came out and it 's funny as well. Have you seen it? KGSF: I also like Hot Fuzz. NTRD: I think you will like Borat.
User: Not yet, but I plan too.
Human: Ok. Hope I helped, bye. KGSF: I also like Groundhog Day. NTRD: I hope you enjoy your night! Table 4: A sampled case between a real user as a seeker, and the dialogue agents (including human, KGSF and our NTRD) as recommenders. Items mentioned are marked in the blue color, while the user preferences in user's turn are marked in the red color. our NTRD and human recommender respond with goodwill and end the dialogue, while KGSF still tries to give a new movie recommendation. In a nutshell, our NTRD is able to provide informative and interesting item recommendation at the utterance level, and behaves more naturally and interactively at the dialogue level.

Conclusions
In this paper, we introduce NTRD, a novel recommender dialogue framework that decouples the dialogue generation from the item recommendation via a two-stage strategy. Our approach reconciles classical slot filling approaches with modern neural NLG approaches, which make the recommender dialogue system more flexible and controllable. Besides, our approach exhibits the promising generalization ability to recommend novel items that do not appear in the training corpus. Extensive experiments show our approach significantly outperforms the previous state-of-the-art methods.
For future work, the generalization ability of NTRD could be further explored. Current method only supports one placeholder with broad semantics to represent all the item mentions in the dialogue corpus, which is lack of fine-grained annotation. One possible attempt is to extend it to support fine-grained item placeholders, such as replacing the placeholder with different attributes of the items, to further improve its performance.