A Template-guided Hybrid Pointer Network for Knowledge-based Task-oriented Dialogue Systems

Most existing neural network based task-oriented dialog systems follow encoder-decoder paradigm, where the decoder purely depends on the source texts to generate a sequence of words, usually suffering from instability and poor readability. Inspired by the traditional template-based generation approaches, we propose a template-guided hybrid pointer network for knowledge-based task-oriented dialog systems, which retrieves several potentially relevant answers from a pre-constructed domain-specific conversational repository as guidance answers, and incorporates the guidance answers into both the encoding and decoding processes. Specifically, we design a memory pointer network model with a gating mechanism to fully exploit the semantic correlation between the retrieved answers and the ground-truth response. We evaluate our model on four widely used task-oriented datasets, including one simulated and three manually created datasets. The experimental results demonstrate that the proposed model achieves significantly better performance than the state-of-the-art methods over different automatic evaluation metrics.


Introduction
Task oriented dialogue systems have attracted increasing attention recently due to broad applications such as reserving restaurants and booking flights. Conventional task-oriented dialogue systems are mainly implemented by rule-based methods (Lemon et al., 2006;Wang and Lemon, 2013), which rely heavily on the hand-crafted features, establishing significant barriers for adapting the dialogue systems to new domains. Motivated by the great success of deep learning in various NLP tasks, the neural network based methods (Bordes 1 https://github.com/wdimmy /THPN et al., 2017;Madotto et al., 2018) have dominated the study since these methods can be trained in an end-to-end manner and scaled to different domains.
Despite the remarkable progress of previous studies, the performance of task-oriented dialogue systems is still far from satisfactory. On one hand, due to the exposure bias problem (Ranzato et al., 2016), the neural network based models, e.g., the sequence to sequence models (seq2seq), tend to accumulate errors with increasing length of the generation. Concretely, the first several generated words can be reasonable, while the quality of the generated sequence deteriorates quickly once the decoder produces a "bad" word. On the other hand, as shown in previous works (Cao et al., 2018;Madotto et al., 2018), the Seq2Seq models are likely to generate non-committal or similar responses that often involve high-frequency words or phrases. These responses are usually of low informativeness or readability. This may be because that arbitrarylength sequences can be generated, and it is not enough for the decoder to be purely based on the source input sentence to generate informative and fluent responses.
We demonstrate empirically that in task-oriented dialogue systems, the responses for the requests with similar types often follow the same sentence structure except that different named entities are used according to the specific dialogue context. Table 1 shows two conversations from real taskoriented dialogues about navigation and weather. From the navigation case, we can observe that although the two requests are for different destinations, the corresponding responses are similar in sentence structure, replacing "children's health" with "5677 springer street". For the weather example, it requires the model to first detect the entity "carson" and then query the corresponding information from the knowledge base (KB). After obtaining Gold the temperature in carson on tuesday will be low of 20f and high of 40f the returned KB entries, we generate the response by replacing the corresponding entities in the retrieved candidate answer. Therefore, we argue that the golden responses of the requests with similar types can provide a reference point to guide the response generation process and enable to generate high-quality responses for the given requests.
In this paper, we propose a template-guided hybrid pointer network (THPN to generate the response given a user-issued query, in which the domain specific knowledge base (KB) and potentially relevant answers are leveraged as extra input to enrich the input representations of the decoder. Here, knowledge base refers to the database to store the relevant and necessary information for supporting the model in accomplishing the given tasks. We follow previous works and use a triple (subject, relation, object) representation. For example, the triple (Starbucks, address, 792 Bedoin St) is an example in KB representing the information related to the Starbucks. Specifically, given a query, we first retrieve top-n answer candidates from a pre-constructed conversational repository with question-answer pairs using BERT (Devlin et al., 2018). Then, we extend memory networks (Sukhbaatar et al., 2015) to incorporate the commonsense knowledge from KB to learn the knowledge-enhanced representations of the dialogue history. Finally, we introduce a gating mechanism to effectively utilize candidate answers and improve the decoding process. The main contributions of this paper can be summarized as follows: • We propose a hybrid pointer network consisting of entity pointer network (EPN) and pattern pointer network (PPN) to generate informative and relevant responses. EPN copies entity words from dialogue history, and PPN extracts pattern words from retrieved answers.
• We introduce a gating mechanism to learn the semantic correlations between the userissued query and the retrieved candidate answers, which reduces the "noise" brought by the retrieved answers.
• We evaluate the effectiveness of our model on four benchmark task-oriented dialogue datasets from different domains. Experimental results demonstrate the superiority of our proposed model.

Related Work
Task-oriented dialogue systems are mainly studied via two different approaches: pipeline based and end-to-end. Pipeline based models (Williams and Young, 2007;Young et al., 2013) achieve good stability but need domain-specific knowledge and handcrafted labels. End-to-end methods have shown promising results recently and attracted more attention since they are easily adapted to a new domain. Neural network based dialogue systems can avoid the laborious feature engineering since the neural networks have great ability to learn the latent representations of the input text. However, as revealed by previous studies (Koehn and Knowles, 2017;Cao et al., 2018;He et al., 2019), the performance of the sequence to sequence model deteriorates quickly with the increase of the length of generation. Therefore, how to improve the stability and readability of the neural network models has attracted increasing attention.  proposed a copy augmented Seq2Seq model by copying relevant information directly from the KB information. Madotto et al. (2018) proposed a generative model by employing the multi-hop attention over memories with the idea of pointer network. Wu et al. (2019) proposes a global-tolocally pointer mechanism to effectively utilize the knowledge base information, which improves the quality of the generated response.
Previous proposed neural approaches have shown the importance of external knowledge in the sequence generation (Chen et al., 2017;Zhu et al., 2018;Yang et al., 2019;Ding et al., 2019), especially in the task-oriented dialogue systems where an appropriate response usually requires correctly extracting knowledge from the domain-specific or commonsense knowledge base (Madotto et al., 2018;Zhu et al., 2018;Qin et al., 2019). However, it is still under great exploration with regard with the inclusion of external knowledge into the model. Yan et al. (2016); Song et al. (2018) argue that retrieval and generative methods have their own demerits and merits, and they have achieved good performance in the chit-chat response generation by incorporating the retrieved results in the Seq2Seq based models. Zhu et al. (2018) proposed an adversarial training approach, which is enhanced by retrieving some related candidate answers in the neural response generation, and Ghazvininejad et al. (2018) also applies a similar method in the neural conversation model. In addition, in task-oriented dialogue tasks, the copy mechanism (Gulcehre et al., 2016) has also been widely utilized Madotto et al., 2018), which shows the superiority of generation based methods with copy strategy.

Methodology
We build our model based on a seq2seq dialogue generation mode, and the overall architecture is exhibited in Figure 1. Each module will be elaborated in the following subsections.

Encoder Module
By checking if a word is in the given KB, we divide words into two types: entity words (EW) and non-entity words (NEW). Taking "what is the temperature of carson on tuesday" as an example, all words are NEW except for "carson" and "tuesday".
We represent a multi-turn dialogue as D = where T is the number of turns in the dialogue, and u i and s i denote the utterances of the user and the system at the i th turn, respectively. KB information is represented as KB = {k 1 , k 2 , · · · , k l }, where k i is a tuple and l is the size of KB. Following Madotto et al. (2018), we concatenate the previous dialogue and KB as input. At first turn, input to the decoder is [u 1 ; KB], the concatenation of first user request and KB. For i > 1, previous history dialog information is included, namely, input is supposed to be [u 1 , s 1 , · · · , u i ; KB]. We define words in the concatenated input as a sequence of tokens W = {w 1 , w 2 , · · · , w n }, where w j ∈ {u 1 , s 1 , · · · , u i , KB} , n is the number of tokens.
In this paper, we use the memory network (MemNN) proposed in Sukhbaatar et al. (2015) as the encoder module. The memories of MemNN are represented by a set of trainable embedding matrices M = {M 1 , M 2 , · · · , M K }, where K represents the number of hops and each M k maps the input into vectors. Different from Sukhbaatar et al. (2015); Madotto et al. (2018), we initialize each M k with the pre-trained embeddings 2 , whose weights are set to be trainable. At hop k, W is mapped to a set of memory vectors, {m k 1 , m k 2 , · · · , m k n }, where the memory vectors m k i of dimension d from M k is computed by embedding each word in a continuous space, in the simplest case, using an embedding matrix A. A query vector q is used as a reading head, which will loop over K hops and compute the attention weights at hop k for each memory by taking the inner product followed by a softmax function, where p k i is a soft memory selector that decides the memory relevance with respect to the query vector q. The model then gets the memory c k by the weighted sum over m k+1 , In addition, the query vector is updated for the next hop by q k+1 = q k + c k . In total, we can achieve K hidden states encoded from MemNN, represented Masking NEW in the history dialogue We observe that the ratio of non-entity words in both the history dialogue and the expected response is extremely low. Therefore, to prevent the model from copying non-entity words from the history dialogue, we introduce an array R h 3 whose elements are zeros and ones, where 0 denotes NEW and 1 for EW. When w i is pointed to, and if i is the sentinel location or R h [i] = 0, then w i will not be copied. Figure 1: The overall structure of our model. During test time, given a user query q, we retrieve at most 3 similar questions to q using BERT from QA Paris repository, and the corresponding answers are used as our answer templates. The retrieved answers as well as the dialogue history and KB information are then utilized for the response generation. Especially, we utilize the gating mechanism to filter out noise from unrelated retrieval results. Finally, words are generated either from the vocabulary or directly copying from the multi-source information using a hybrid pointer network.

Retrieval Module
For each dataset, we use the corresponding training data to pre-construct a question-answer repository. In particular, we treat each post-response (u i and s i ) in a dialogue as a pair of question-answer. To effectively retrieve potentially relevant answers, we adopt a sentence matching based approach, in which each sentence is represented as a dense vector, and the cosine similarity serves as the selection metrics. We have explored several unsupervised text matching methods, such as BM25 (Robertson et al., 2009), Word2Vec (Mikolov et al., 2013b), and BERT (Devlin et al., 2018), and revealed that BERT could achieve the best performance. In addition, based on our preliminary experiments, we observed that the number of retrieved answer candidates have an impact on the model performance, so we define a threshold θ for controlling the number of retrieval answer candidates.
Specifically, for each question in the preconstructed database, we pre-compute the corresponding sentence embedding using BERT. Then, for each new user-issued query u q , we embed u q into u e q , and search in the pre-constructed database for the most similar requests based on cosine similarity. The corresponding answers are selected and serve as our answer candidates.
Masking EW in the retrieved answers In real dialogue scenes, the reply's sentence structure might be similar but the involved entities are usually different. To prevent the model from copying these entities, we introduce another array R r similar to R h mentioned before. Finally, the retrieved candidate answers are encoded into lowdimension distributed representations, denoted as AN = {a 1 , a 2 , · · · , a m }, where m is the total number of the words. Moreover, by an interaction between c K and AN = {a 1 , a 2 , · · · , a m }, we obtain a dense vector h a as the representation of the retrieved answers,

Decoder Module
We first apply Gated Recurrent Unit (GRU) (Chung et al., 2014) to obtain the hidden state h t , where φ emb (·) is an embedding function that maps each token to a fixed-dimensional vector. At the first time step, we use the special symbol "SOS" as y 0 and the initial hidden state consists of three parts, namely, the last hidden state h t−1 , the attention over C = {c 1 , c 2 , · · · , c K } from the encoder module, denoted as H c , and H g , which is calculated by linearly transforming last state h t−1 and h a with a multi-layer perceptron network. We formulate H c and H g as follows: Attention over C = {c 1 , c 2 , · · · , c K } Since MemNN consists of multiple hops, we believe that different hops are relatively independent and have their own semantic meanings over the history dialog. At different time steps, we need to use different semantic information to generate different tokens, so our aim is to get a context-aware representation. We can achieve it by applying attention mechanism to the hidden states achieved at different hops, where η is the function that represents the correspondence for attention, usually approximated by a multi-layer neural network.
Template-guided gating mechanism As reported in Song et al. (2018), the top-ranked retrieved reply is not always the one that best match the query, and multiple retrieved replies may provide different reference information to guide the response generation. However, using multiple retrieved replies may increase the probability of introducing "noisy" information, which adversely reduces the quality of the response generation. To tackle this issue, we add a gating mechanism to the hidden state of candidate answers, aiming at extracting valuable "information" at different time steps. Mathematically, We use element-wise multiplication to model the interaction between candidate answers (h a ) and last hidden state of GRU. h * t−1 is obtained by concatenating h t−1 , H c , and H g .
Hybrid pointer networks We use another MemNN with three hops for the response generation, where h t of GRU serves as the initial reading head, as shown in Figure 1 Other than a candidate softmax P v used for generating a word from the vocabulary, we adopt the idea of Pointer Softmax in Gulcehre et al. (2016), and introduce an Entity Pointer Networks (EPN) and a Pattern Pointer Networks (PPN), where EPN is trained to learn to copy entity words from dialogue history (or KB), and PPN is responsible for extracting pattern words from retrieved answers. For EPN, we use a location softmax P h , which is a pointer network where each of the output dimension corresponds to the location of a word in the context sequence. Likewise, we introduce a location softmax P r for PPN. P v is generated by concatenating the first hop attention read out and the current query vector, For P r and P h , we take the attention weights at the second MemNN hop and the third hop of the decoder, respectively: P r = p 2 o and P h = p 3 o . The output dimensions of P h and P v vary according to the length of the corresponding target sequence.
With the three distributions, the key issue is how to decide which distribution should be chosen to generate a word w i for the current time step. Intuitively, entity words are relatively important, so we set the selection priority order as P r > P h > P v . Instead of using a gate function for selection (Gulcehre et al., 2016), we adopt the sentinel mechanism proposed in Madotto et al. (2018). If the expected word is not appearing in the memories, then P h and P r are trained to produce a sentinel token 4 . When both P h and P r choose the sentinel token or the masked position, our model will generate the token from P v . Otherwise, it takes the memory content using P v or P r .

Datasets
We use four public multi-turn task-oriented dialog datasets to evaluate our model, including bAbI , In-Car Assistant  , DSTC2 (Henderson et al., 2014) and CamRest (Wen et al., 2016). bAbI is automatically generated and the other three datasets are collected from real human dialogs.
bAbI We use tasks 1-5 from bAbI dialog corpus for restaurant reservation to verify the effectiveness of our model. For each task, there are 1000 dialogs for training, 1000 for development, and 1000 for testing. Tasks 1-2 verify dialog management to check if the model can track the dialog state implicitly. Tasks 3-4 verify if the model can leverage the KB tuples for the task-oriented dialog system. Tasks 5 combines Tasks 1-4 to produce full dialogs.
In-Car Assistant This dataset consists of 3,031 multi-turn dialogs in three distinct domains: calendar sheduling, weather information retrieval, and point-of-interest navigation. This dataset has an average of 2.6 conversation turns and the KB information is complicated. Following the data processing in Madotto et al. (2018), we obtain 2,425/302/304 dialogs for training/validation/testing respectively.

DSTC2
The dialogs were extracted from the Dialogue State Tracking Challenge 2 for restaurant reservation. Following Bordes et al. (2017), we use merely the raw text of the dialogs and ignore the dialog state labels. In total, there are 1618 dialogs for training, 500 dialogs for validation, and 1117 dialogs for testing. Each dialog is composed of user and system utterances, and API calls to the domain-specific KB for the user's queries.
CamRest This dataset consists of 676 human-tohuman dialogs in the restaurant reservation domain. This dataset has much more conversation turns with 5.1 turns on average. Following the data processing in Wen et al. (2017), we divide the dataset into training/validation/testing sets with 406/135/135 dialogs respectively.

Implementation Detail
We use the 300-dimensional word2vec vectors to initialize the word embeddings. The size of the GRU hidden units is set to 256. The recurrent weight parameters are initialized as orthogonal matrices. We initialize the other weight parameters with the normal distribution N (0, 0.01) and set the bias terms as zero. We train our model with Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 1e − 4. By tuning the hyperparameters with the grid search over the validation sets, we find the other best settings in our model as follows. The number of hops for the memory network is set to 3, and gradients are clipped with a threshold of 10 to avoid explosion. In addition, we apply the dropout (Hinton et al., 2012) as a regularizer to the input and output of GRU, where the dropout rate is set to be 0.4.

Baseline Models
We compare our model with several existing endto-end task-oriented dialogue systems 5 : 5 Part of experimental results of baseline models are directly extracted from corresponding published papers.
• Retrieval method: This approach directly uses the retrieved result as the answer of the given utterance. Specifically, we use BERT-Base as a feature extractor for the sentences, and we use the cosine distance of the features as our retrieve scores, and then select the one with the highest score.
• MemNN: An extended Seq2Seq model where the recurrence read from a external memory multiple times before outputting the target word (Sukhbaatar et al., 2015).
• PtrUnk: An augmented sequence-tosequence model with attention based copy mechanism to copy unknown words during generation (Gulcehre et al., 2016).
• CASeq2Seq: This is a copy-augmented Seq2Seq model that learns attention weights to dialogue history with copy mechanism .
• Mem2Seq: A memory network based approach with multi-hop attention for attending over dialogue history and KB tuples (Madotto et al., 2018).
• BossNet: A bag-of-sequences memory architecture is proposed for disentangling language model from KB incorporation in task-oriented dialogues (Raghu et al., 2019).
• WMM2Seq: This method adopts a working memory to interact with two separated memory networks for dialogue history and KB entities (Chen et al., 2019).
• GLMP: This is an augmented memory based model with a global memory pointer and a local memory pointer to strengthen the model's copy ability (Wu et al., 2019).

Automatic Evaluation Metrics
In bAbI dataset, we adopt a common metric perresponse accuracy (Bordes et al., 2017) to evaluate the model performance. Following previous works (Madotto et al., 2018), for three real human dialog datasets, we employ bilingual evaluation understudy (BLEU) (Papineni et al., 2002) and Entity F1 scores to evaluate the model's ability to generate relevant entities from knowledge base and to capture the semantics of the user-initiated dialogue flow .
BLEU We use BLEU to measure the n-gram (i.e., 4-gram) matching between the generated responses and the reference responses. The higher BLEU score indicates a better performance of the conversation system. Formally, we compute the 4-gram precision for the generated response Y as: whereỸ traverses all candidate 4-grams, Y and Y are the ground-truth and predicted responses, η(Ỹ , Y ) indicates the number of 4-grams in Y . After achieving the precision, the BLEU score is then calculated as: (9) where β n = 1/4 is a weight score. ν(Y,Ŷ ) is a brevity penalty that penalizes short sentences. The higher BLEU score indicates better performance of the conversation system.

Per-response Accuracy
We adopt the perresponse accuracy metric to evaluate the dialog system's capability of generating an exact, correct responses. A generated response is considered right only if each word of the system output matches the corresponding word in the gold response. The final per-response accuracy score is calculated as the percentage of responses that are exactly the same as the corresponding gold dialogues. Per-response accuracy is a strict evaluation measure, which may only be suitable for the simulated dialog datasets.
Entity F1 Entity F1 metric is used measure the system's capability of generating relevant entities from the provided task-oriented knowledge base. Each utterance in the test set has a set of gold entities. An entity F1 is computed by micro-averaging over all the generated responses.

Automatic Evaluation on Four Datasets
bAbI The dataset is automatically generated based on some rules, thus many requests and their corresponding replies are quite similar in terms of the syntactic structure and the wording usage. According to the results shown in Table 5   see that our model achieves the best per-response scores in all the five tasks. It is also believed that the retrieved results can contribute to guiding the response generation in this case, which can be inferred from the high threshold value (θ = 0.8).
In-Car Assistant Dataset As shown in Table 6

DSTC2 and CamRest Datasets
We also present the evaluation on DSTC2 and CamRest datasets in Table 8 and Table 9, respectively. By comparing the results, we can notice that our model performs better than the compared methods. On the DSTC2, our model achieves the state-of-the-art performance in terms of both Entity F1 score and BLEU metrics, and has a comparable per-response accuracy with compared methods. On the CamRest, our model obtains the best Entity F1 score but has a drop in BLEU in comparison to Mem2Seq model.

Ablation Study
An ablation study typically refers to removing some components or parts of the model, and seeing how that affects performance. To measure the influence of the individual components, we evaluate the proposed THPN model with each of them removed separately, and then measure the degradation of the overall performance. Table 7 reports ablation study results of THPN on bAbI and DSTC2 datasets by removing retrieved answers (w/o IR), removing EPN and PPN in decoding (w/o Ptr), removing answer-guided gating mechanism (w/o Gate), respectively. For example, "w/o Gate" means we do not use the answer-guided gating mechanism while keeping other components intact.
If the retrieved answer is not used, the performance reduces dramatically, which can be interpreted that without the guiding information from the retrieved answer, the decoder may deteriorate quickly once it produce a "bad" word since it solely relies on the input query.
If no copy mechanism is used, we can see that Entity F1 score is the lowest, which indicates that many entities are not generated since these entity words may not be included in the vocabulary. Therefore, the best way to generate some unseen words is to directly copy from the input query, which is consistent with the findings of previous work Madotto et al., 2018).
If the gate is excluded, we can see around 2% drop for DSTC2. A possible reason is that some useless retrieved answers introduce "noise" to the system, which deteriorates the response generation.

Effect of Masking Operation
To validate the effectiveness of the masking operation, we carry out a comparison experiment on In-Car Assistant, and present the results in Table 2. From Table 2, we can see that R + h & R + r achieves the best performance while R − h & R − r has the lowest scores. By diving into the experimental results, we find that if we do not mask EW in the retrieved answers, the model copies many incorrect entities from the retrieved answers, which reduces the Entity F1 scores. If we do not mask NEW in the history dialogue, the percentage of NEW copied from the history dialogue is high, most of which are unrelated to the gold answer, thus bringing down the BLEU score.

Analysis on Retrieved Results
Comparison of Different Retrieval Methods According to our preliminary experimental results, we observed that better retrieved candidate answers could further improve the overall model performance in response generation. Therefore, we also conduct experiments to evaluate the effectiveness of three popular text matching methods, including BM25 (Robertson et al., 2009), word2vec (Mikolov et al., 2013a) and BERT (Devlin et al., 2018).    Here, BLEU is utilized as our evaluation criterion. From the experimental results shown in Table 4, we can see that using BERT (Devlin et al., 2018), a transformer-based pre-trained language model, achieves the highest BLEU scores. A possible reason is that the size of each training dataset is limited, the word co-occurrence based algorithms (e.g., BM25) may not capture the semantic information, thus result in poor retrieving performance.
One vs. Multiple Retrieved Answers Cosine similarity is not an absolute criterion and there is no guarantee that a candidate with higher cosine value will always provide more reference information to the response generation. Therefore, we conduct an experiment to investigate the effect of the number of retrieved answers. By setting different cosine threshold values θ, we retrieve different numbers of answer candidates. In particular, if no answer candidate satisfies the given threshold, we choose one with the highest cosine value. To limit the number of retrieved answers, we only select the top-3 results if there are more than three answer candidates that have higher consine values than the given threshold θ. Table 3 gives the experimental results of DSTC2 dataset under different threshold θ values. When θ is set to be 1.0, it is considered as a special case where only one answer is retrieved. We can observe that using multiple answer candidates obtains higher performance than only using one result. It is intuitive that the model will be misguided if the retrieved single answer has no relation to the given request, and using multiple candidate answers can ameliorate this issue.
Setting of θ Although using more retrieved answers might improve the chance of including the relevant information, it may also bring more "noise" and adversely affect the quality of retrieved answers. From Table 3, we can see that with the reduced value of θ, the average number of retrieved candidate answers increase, but the model performance does not improve accordingly. Experimental results on the other datasets demonstrate that the θ is not fixed and needs to be adjusted according to the experimental data.

Conclusion
In task-oriented dialog systems, the words and sentence structures are relatively limited and fixed, thus it is intuitive that the retrieved results can provide valuable information in guiding the response generation. In this paper, we retrieve several potentially relevant answers from a pre-constructed domain-specific conversation repository as guidance answers, and incorporate the guidance answers into both the encoding and decoding processes. We copy the words from the previous context and the retrieved answers directly, and generate words from the vocabulary. Experimental results over four datasets have demonstrated the effectiveness of our model in generating informative responses. In the future, we plan to leverage the dialogue context information to retrieve candidate answers turn by turn in multi-turn scenarios.