Contextualize Knowledge Bases with Transformer for End-to-end Task-Oriented Dialogue Systems

Incorporating knowledge bases (KB) into end-to-end task-oriented dialogue systems is challenging, since it requires to properly represent the entity of KB, which is associated with its KB context and dialogue context. The existing works represent the entity with only perceiving a part of its KB context, which can lead to the less effective representation due to the information loss, and adversely favor KB reasoning and response generation. To tackle this issue, we explore to fully contextualize the entity representation by dynamically perceiving all the relevant entities and dialogue history. To achieve this, we propose a COntext-aware Memory Enhanced Transformer framework (COMET), which treats the KB as a sequence and leverages a novel Memory Mask to enforce the entity to only focus on its relevant entities and dialogue history, while avoiding the distraction from the irrelevant entities. Through extensive experiments, we show that our COMET framework can achieve superior performance over the state of the arts.


Introduction
Task-oriented dialogue systems aim to achieve specific goals such as hotel booking and restaurant reservation. The traditional pipelines (Young et al., 2013;Wen et al., 2017) consist of natural language understanding, dialogue management, and natural language generation modules. However, designing these modules often requires additional annotations such as dialogue states. To simplify this procedure, the end-to-end dialogue systems  are proposed to incorporate the KB (normally relational databases) into the learning framework, where the KB and dialogue history can be directly modeled for response generation, without the explicit dialogue state or dialogue action.   . The top is the entities in KB and the bottom is a twoturn dialogue between the user and system.
An example of the end-to-end dialogue systems is shown in Tab. 1. When generating the second response about the "traffic info": (1) the targeted entity "no traffic" is associated with its same-row entities (KB context) like "Tome's house", "friend's house" and "6 miles". These entities can help with enriching the information of its representation and modeling the structure of KB. (2) Also, the entity is related to the dialogue history (dialogue context), which provides clues about the goal-related row (like "Tom's house" and "580 Van Ness Ave" in the first response). These clues can be leveraged to further enhance the corresponding representations and activate the targeted row, which benefits the retrieval of "no traffic". Therefore, how to fully contextualize the entity with its KB and dialogue contexts, is the key point of end-to-end dialogue systems (Madotto et al., 2018;Qin et al., 2020), where the full-context enhanced entity representation can make the reasoning over KB and the response generation much easier.
However, the existing works can only contextualize the entity with perceiving parts of its KB context and ignoring the dialogue context: (1) (Madotto et al., 2018;Qin et al., 2020) rep- Figure 1: Four ways to represent the KB, where e i,j means the entity representation for the j-th entity of the i-th row; R i means the row representation of the ith row; e ·,j means the entities shared between different row, like "no traffic" in Tab. 1; D means the dialogue context. Note that the existing three representations (a-c) only consider parts of the KB context and ignore the dialogue context, whereas our method (d) can fully contextualize the entity with both of them.
resent an entity as a triplet (cf. Fig. 1(a)), i.e., (Subject, Relation, Object). However, breaking one row into several triplets can only model the relation between two entities, whereas the information from other same-row entities and dialogue history are ignored.
(2) (Gangi Reddy et al., 2019;Qin et al., 2019) represent KB in a hierarchical way, i.e., the row and entity-level representation (cf. Fig. 1(b)). This representation can only partially eliminate this issue at the row level. However, at the entity level, the entity can only perceive the information of itself, which is isolated with other KB and dialogue contexts.
(3) (Yang et al., 2020) converts KB to a graph (cf. Fig. 1(c)). However, they fails to answer what is the optimal graph structure for KB. That indicates their graph structure may need manual design 1 . Also, the dialogue context is not encoded into the entity representation, which can also lead to the suboptimal entity representation. To sum up, these existing methods can not fully contextualize the entity, which leads to vulnerable KB reasoning and response generation. In this work, we propose COntext-aware Memory Enhanced Transformer (COMET), which provides a unified solution to fully contextualize the entity with the awareness of both the KB and dialogue contexts (shown in Fig. 1(d)). The key idea of COMET is that: a Memory-Masked En-coder is used to encode the entity sequence of KB, along with the information of dialogue history. The designed Memory Mask is utilized to ensure the entity can only interact with its same-row entities and the information in dialogue history, whereas the distractions from other rows are prohibited.
More specifically, (1) for the KB context, we represent the entities in the same row as a sequence. Then, a Transformer Encoder (Vaswani et al., 2017) is leveraged to encode them, where the same-row entities can interact with each other. Furthermore, to retain the structure of KB and avoid the distractions from the entities in different rows, we design a Memory Mask (shown in Fig. 3) and incorporate it into the encoder, which only allows the interactions between the same-row entities.
(2) For the dialogue context, we create a Summary Representation (Sum. Rep) to summarize the dialogue history, which is input into the encoder to interact with the entity representations (gray block in Fig. 2). We also utilize the Memory Mask to make the Sum. Rep overlook all of the entities for better entity representations, which will serve as the context-aware memory for further response generation.
By doing so, we essentially extend the entity of KB to (N + 1)-tuple representation, where N is the number of entities in one row and "1" is for the Sum. Rep of the dialogue history. By leveraging the KB and dialogue contexts, our method can effectively model the information existing in KB and activate the goal-related entities, which benefits the entity retrieval and response generation. Please note that the function of fully contextualizing entity is unified by the designed Memory Mask scheme, which is the key of our work.
We conduct extensive experiments on two public benchmarks, i.e., SMD Madotto et al., 2018) and Multi-WOZ 2.1 (Budzianowski et al., 2018;Yang et al., 2020). The experimental results demonstrate significant performance gains over the state of the arts. It validates that contextualizing KB with Transformer benefits entity retrieval and response generation.
In summary, our contributions are as follows: • To the best of our knowledge, we are the first to fully contextualize the entity representation with both the KB and dialogue contexts, for end-to-end task-oriented dialogue systems.  awareness of both the relevant entities and dialogue history.
• Extensive experiments demonstrate that our method gives a state-of-the-art performance.

Methodology
In this section, we first introduce the general workflow for this task. Then, we elaborate on each part of COMET, i.e., the Dialogue History Encoder, Context-aware Memory Generation, and Response Generation Decoder (as depicted in Fig. 2). Finally, the objective function will be introduced.

General Workflow
Given a dialogue history with k turns, which is denoted as H = {u 1 , s 1 , u 2 , s 2 , ..., u k } (u i and s i denote the i-th turn utterances between the user and the system), the goal of dialogue systems is to generate the k-th system response s k with an which has r rows and c columns. Formally, the procedure mentioned above is defined as: where we first derive the dialogue history representation (Section 2.2) and generate the Context-aware Memory, a.k.a., contextualized entity representation (Section 2.3), where these two parts will be used to generate the response s k (Section 2.4).

Dialogue History Encoder
We first transform H into the word-by-word form with a special token [SUM]: , which is used to globally aggregate information from H. Then, the sequenceĤ is encoded by a standard Transformer Encoder and generate the dialogue history representation H enc N , where H enc N,1 is denoted as the Summary Representation (Sum. Rep) of the dialogue history. 2 It will be used to make the memory aware of the dialogue context.

Context-aware Memory Generation
In this subsection, we describe how to "fully contextualize KB". That is, the Memory Mask is leveraged to ensure the entities of KB with the awareness of all of its related entities and dialogue history, which is the key contribution of our method.

Memory Generation
Different from existing works which fail to contextualize all the useful context information for the entity representation, we treat KB as a sequence, along with Sum. Rep. Then, a Transformer Encoder with the Memory Mask is utilized to model it, which can dynamically generate the entity representation with the awareness of its all favorable contexts, i.e., the same-row entities and dialogue history, while blocking the distraction from the irrelevant entities. The procedure of memory generation is as follows.
Firstly, the entities in the KB B is flatten as a memory sequence, i.e., M = [b 11 , ..., b 1c , ..., b r1 , ..., b rc ] = [m 1 , m 2 , ..., m |M| ], where the memory entity m i means an entity of KB in the k-th row. By doing so, the Memory-Masked Transformer Encoder can interact the same-row entities with each other while retaining the structure information of KB. 3 Then, M will be transformed into the entity embeddings, i.e., E = [e m 1 , ..., e m |M| ], where e m i corresponds to m i in M and it is the sum of the word embedding u i and the type embedding t i , i.e., e m i = u i + t i . Note that, the entity types are the corresponding column names, e.g., "poi_type" in Table 1. For the entities which have more than one token, we simply treat them as one word, e.g., "Stanford Exp" → "Stanford_Exp".
Next, the entity embeddings are concatenated with the Sum. Rep from the Dialogue History En- The purpose of introducing H enc N,1 is that it passes the information from the dialogue history and further enhances the entity representation with the dialogue context.
Finally, E 0 and the Memory Mask M mem are used as the input of the Transformer Encoder (tf _enc(·)) to generate the context-aware memory (a.k.a, contextualized entity representation): where K is the total number of Transformer Encoder layers. E K ∈ R (|M|+1)×dm is the generated memory, which is queried when generating the response for entity retrieval.

Memory Mask Construction
To highlight, we design a special Memory Mask scheme to take ALL the contexts grounded by the entity into account, where the Memory Mask ensures that the entity can only attend to its context part, which is the key contribution of this work. This is in contrast to the standard Transformer Encoder, where each entity can attend to all of the other entities. The rationale of our design is that by doing so, we can avoid the noisy distraction of the non-context part.
A detailed illustration of the Memory Mask construction is shown in Fig. 3. With this designed Memory Mask, a masked attention mechanism is leveraged to make the entity only attend the entities within the same row and the Sum. Rep.

Response Generation Decoder
Given the dialogue history representation H enc N and generated memory E K , the decoder will use them to generate the response for a specific query. In COMET, we use a modified Transformer Decoder, which has two cross attention modules to model the information in H enc N and E K , respectively. Then, a gate mechanism is leveraged to adaptively fuse H enc N and E K for the decoder, where the response generation is tightly anchored by them.
Following Qin et al., 2020;Yang et al., 2020), we first generate a sketch response that replaces the exact slot values with sketch tags. 4 Then, the decoder links the entities in the memory to their corresponding slots.

Sketch Response Generation
For the k-th turn generating sketch response Y = [y 1 , ...y t−1 ], it is converted to the word representation H dec where v i and p i means the word embedding and absolute position embedding of i-th token in Y.
Afterward, N -stacked decoder layers are applied to decode the next token with the inputs of H dec 0 , E K and H enc N . The process in one decoder layer can be expressed as: where the input {Q, K, V, M } of the Multi-Head Attention M HA(Q, K, V, M ) means the query, key, value, and optional attention mask. F F N (·) means the Feed-Forward Networks. M dec is the decoder mask, so as to make the decoded word can only attend to the previous words. F C(·) is a fully-connected layer to generate the gating signals, which maps a d m -dimension feature to a scalar. N is the number of the total decoder layers.
After obtaining the final H dec N , the posterior distribution for the t-th token, p v t ∈ R |V | (|V | denotes the vocabulary size), is calculated by:

Entity Linking
After the sketch response generation, we replace the sketch tags with the entities in the contextaware memory. We denote the representation from the decoder at the t-th time step, i.e., the t-th token, as H dec N,t , and represent the time steps that need to replace sketch tags with entities as T . The probability distribution over all possible linked entities can then be calculated by where E K means the final generated memory.

Objective Function
For the training process of COMET, we use the the cross-entropy loss to supervise the response generation and entity linking 5 . Moreover, we propose an additional regularization term to further regularize p s t . The regularization is based on the prior knowledge that for a given response, only a small subset of entities should be linked. Formally, we construct the following entity linking probability matrix P s = [p s t 1 , p s t 2 , ..., p s t |T | ] and minimize its L 2,1 -norm (Nie et al., 2010): where p s t,i denotes the i-th dimension of p s t . This regularization term can encourage the network to select a small subset of entities to generate the response. The same idea has been investigated in (Nie et al., 2010) for multi-class feature selection.
Finally, COMET is trained by jointly minimizing the combination of the above three losses.

Datasets
Two public multi-turn task-oriented dialogue datasets are used to evaluate our model, i.e., SMD 6  and Multi-WOZ 2.1 7 (Budzianowski et al., 2018). Note that, for Multi-WOZ 2.1, to accommodate end-to-end settings, we use the revised version released by (Yang et al., 2020), which equips the corresponding KB to every dialogue. We follow the same partition as (Madotto et al., 2018) on SMD and (Yang et al., 2020) on Multi-WOZ 2.1.

Experimental Settings
The dimension of embeddings and hidden vectors are all set to 512. The number of layers (N ) in Dialogue History Encoder and Response Generation Decoder is set to 6. The number of layers for Context-aware Memory Generation (K) is set to 3. The number of heads in each part of COMET is set to 8. A greedy strategy is used without beam-search during decoding. The Adam optimizer (Kingma 5 The label construction procedure of the entity linking module can be found in Appendix A.

Baselines
We compare COMET with the following methods: • Mem2Seq (Triplet) (Madotto et al., 2018): Mem2Seq incorporates the multi-hop attention mechanism in memory networks into the pointer networks. KB-retriever improves the entity-consistency by first selecting the target row and then picking the relevant column in this row. • GLMP (Triplet) : GLMP uses a global memory encoder and a local memory decoder to incorporate the external knowledge into the learning framework. • DF-Net (Triplet) (Qin et al., 2020): DF-Net applies a dynamic fusion mechanism to transfer knowledge in different domains. • GraphDialog (Graph) (Yang et al., 2020): GraphDialog exploits the graph structural information in KB and in the dependency parsing tree of the dialogue.

Results
Following the existing works (Qin et al., 2020;Yang et al., 2020), we use the BLEU and Entity F1 metrics to evaluate model performance. The results are shown in Tab. 2. It is observed that: COMET achieves the best performance over both datasets, which indicates that our COMET framework can better leverage the information in the dialogue history and external KB, to generate more fluent responses with more accurate linked entities. Specifically, for the BLEU score, it outperforms the previous methods by 2.9% on the SMD dataset and 2.1% on the Multi-WOZ 2.1 dataset, at least. Also, COMET achieves the highest Entity F1 score on both datasets. That is, the improvements of 0.9% and 7.3% are attained on the SMD and Multi-WOZ 2.1 datasets, respectively. In each domain of the two datasets, improvement or competitive performance can be clearly observed.
The results indicate the superior of our COMET framework.
To highlight, KB-Transformer (E. et al., 2019) also leverages Transformer, but our COMET outperforms it by a large margin. On the SMD dataset, the BLEU score of COMET is higher than that of KB-Transformer by 3.4%. The improvement introduced by COMET on Entity F1 score is as significant as 26.5%. This shows naively introducing Transformer to the end-to-end dialogue system will not necessarily lead to higher performance. A careful design of the whole dialogue system, such as our proposed one, plays a vital role.

Ablation Study
In this subsection, we first investigate the effects of the different components, i.e., the Memory Mask, Sum. Rep, gate mechanism, and L 2,1 -norm regularization (Tab. 3). Then, we design careful experiments to further demonstrate the effect of the Memory Mask, which is the key contribution of this work: (1) we replace the context-aware memory of COMET with the existing three representations of KB, (i.e., triplet, row-entity, and graph) to show the superior of the fully contextualized entity (Tab. 4).
(2) We also replace our Memory Mask with the full attention layer by layer, which further shows the importance of our Memory Mask (Tab. 5). Our ablation studies are based on the SMD dataset.  The effects of the key components in the COMET framework are reported in Tab. 3. As observed, removing any key component of the COMET, both the BLEU and Entity F1 metrics degrade to some extend. More specifically: (1) If the Memory Mask is removed, the Entity F1 score drops to 49.6. This significant discrepancy demonstrates the importance of restricting self-attention as our designed Memory Mask did.
(2) For the variant without the Sum. Rep, the Entity F1 score drops to 61.4. That indicates the effectiveness of contextualizing the KB with the dialogue history, which can further boost the performance. (3) We also remove the gate and only use the information from the dialogue history (H enc N ) or memory (E K ). We can see that the former case can only achieve 61.1 while the latter case achieves 61.4 of the Entity F1 score. It is obvious that using the gate mechanism to fuse both information sources is helpful for the entity linking. (4) When removing the L 2,1norm, the performance also drops to 62.3, which means regularizing the entity-linking distribution can further benefit the performance.  Table 4: The performance of replacing the contextaware memory with Triplet, Row-Ent and Graph representations in COMET. Note that in the second row, we also report the result of a variant which only considers the KB context and ignores the dialogue context.
We also replace our context-aware memory with other ways of representing KB, while other parts of our framework keep unchanged 8 . The result is reported in Tab. 4. It is observed that, After replac- 8 The implementation details are in Appendix A.3. ing our context-aware memory with the existing three representations of KB, the performance drops a lot in all the metrics, where the BLEU score drops 2.4% and the Entity F1 score drops 3.8% at least. Besides, the result of the variant which only considers the KB context part (i.e., w/o Sum. Rep), is also reported, so as to further fairly compare with the aforementioned KB representations. The result shows that only considering the KB context, our method can still outperform other KB representations by 1.6% of Entity F1 at least. That further indicates the fully contextualizing entity with its relevant entity and the dialogue history, can better represent the KB for dialogue systems.  We also conduct the experiment which replaces the Memory Mask with the full attention, layer by layer. That is, the first (n-k) layers use the proposed Memory Mask (M) and the last k layers use the full attention (F). As shown in Tab. 5, the more full attention is added, the more performance of COMET drops in all of the metrics since the full attention introduces too much distraction from other rows. The result further indicates that the Memory Mask is indeed a better choice which takes the inductive bias of KB into account.
Note that we also explore other Memory Mask schemes, but these schemes can not further boost the performance, where the results are omitted due to the page limitation. For further improvement, more advanced techniques like Pre-trained Model (Devlin et al., 2018;Radford et al., 2019) may be needed to deeply understand the dialogue and KB context, which we leave for future work.

Case Study
To demonstrate the superiority of our method, several examples on the SMD test set, which are generated by our COMET and the existing state of the arts GLMP  and DF-Net (Qin et al., 2020), are given in Tab. 6. As reported, compared with GLMP and DF-Net, COMET can generate Table 6: Responses generated by our COMET, GLMP  and DF-Net (Qin et al., 2020) from the SMD dataset. Goal means the row that the user is queried. and × mean the right or wrong entity linked. more fluent, informative, and accurate responses.
Specifically, in the first example, GLMP and DF-NET are lack of the necessary information "11am" or provide the wrong entity "5pm". But COMET can obtain all the correct entities, which is more informative. In the second example, our method can generated the response with the right "distance" information but GLMP and DF-Net can not. In the third example, GLMP and DF-Net can not even generate a fluent response, let alone the correct temperature information. But COMET can still perform well. The fourth example is more interesting: the user queries the information about "starbucks" which does not exist in the current KB. GLMP and DF-Net both fail to faithfully respond, whereas COMET can better reason KB to generate the right response and even provide an alternative option.

Related Work
Task-oriented dialogue systems can be mainly categorized into two parts: modularized (Williams and Young, 2007;Wen et al., 2017) and end-toend . For the end-to-end task-oriented dialogue systems,  first explores the end-to-end method for the task-oriented dialogue systems. However, it can only link to the entities in the dialogue context and no KB is incorporated. To effectively incorporate the external KB,  proposes a keyvalue retrieval mechanism to sustain the grounded multi-domain discourse. (Madotto et al., 2018) augments the dialogue systems with end-to-end memory networks (Sukhbaatar et al., 2015). ) models a dialogue state as a fixed-size distributed representation and uses this representation to query KB. (Lei et al., 2018) designs belief spans to track dialogue believes, allowing task-oriented dialogue systems to be modeled in a sequence-tosequence way. (Gangi Reddy et al., 2019) proposes a multi-level memory to better leverage the external KB.  proposes a global-to-local memory pointer network to reduce the noise caused by KB. (Lin et al., 2019) proposes Heterogeneous Memory Networks to handle the heterogeneous information from different sources. (Qin et al., 2020) proposes a dynamic fusion mechanism to transfer the knowledge among different domains. (Yang et al., 2020) exploits the graph structural informa-tion in KB and the dialogue. Other works also explore how to combine the Pre-trained Model (Devlin et al., 2018;Radford et al., 2019) with the endto-end task-oriented dialogue systems. (Madotto et al., 2020a) directly embeds the KB into the parameters of GPT-2 (Radford et al., 2019) via finetuning. (Madotto et al., 2020b) proposes a dialogue model that is built with a fixed pre-trained conversational model and multiple trainable light-weight adapters.
We also notice that some existing works also combine Transformer with the memory component, e.g., (Ma et al., 2021). However, our method is distinguishable from them, since the existing works like (Ma et al., 2021) simply inject the memory component into Transformer. In contrast, inspired by the dynamic generation mechanism (Gou et al., 2020), the memory in COMET (i.e., the entity representation) is dynamically generated by fully contextualizing the KB and dialogue context via the Memory-masked Transformer.

Conclusion
In this work, we propose a novel COntext-aware Memory Enhanced Transformer (COMET) for the end-to-end task-oriented dialogue systems. By the designed Memory Mask scheme, COMET can fully contextualize the entity with all its KB and dialogue contexts, and generate the (N + 1)-tuple representations of the entities. The generated entity representations can further augment the framework and lead to better capabilities of response generation and entity linking. The extensive experiments demonstrate the effectiveness of our method.

A.1 Label Construction of Entity Linking
In practice, the datasets do not provide the golden linked entity. However, We could obtain a pseudo annotation by following (Qin et al., 2019) to use a distant supervision method. Specifically, we match the entities in the golden response against the entities in the memory M and use the matching result as the golden entity. For entities like "no_traffic", one may find matches in multiple rows. We resolve this ambiguity by choosing the entity from the row which has the most matches for all entities in the utterances.  We follow  to randomly mask a small number of entities into an unknown token to improve the generalization of our model. Besides, in the sketch generation and entity linking stages, we also use the label smoothing to regularize the model. The hyper-parameters such as dropout rate are tuned over the development set by grid search (Entity F1 for both datasets). The model is implemented in PyTorch. The hyper-parameters used in two datasets are shown in Tab. 7.

A.3 Implementation Details of Other KB Representations with Transformer
To further compare the different methods of representing KB with our method, we also adopt the triplet, row-entity, and graph representation to replace our contextualized entity representation, where we keep the other parts of COMET unchanged. Specifically, for the triplet representation, we follow (Madotto et al., 2018;Qin et al., 2020) to implement Transformer+Triplet, where the entity representation is the sum of the subject, relation, and object. Besides, the multihop reasoning (Sukhbaatar et al., 2015) is leveraged to further boost the performance. For the row-ent representation, we refer to (Gangi Reddy et al., 2019;Qin et al., 2019) to implement Trans-former+Row&Ent, where Bag-of-word embedding and entity-type embedding are used for the row-level representation and entity-level representation. Besides, the row-level representation and entity-level representation are hierarchically queried, where the distribution of the entity-level embedding is used for the response generation. For the graph representation, we adopt the memory part of GraphDialog (Yang et al., 2020) to implement Transformer+Graph, where the entity embedding is further augmented by Graph Neural Networks (Veličković et al., 2018). Besides, the last hop of the triplet and graph representation, and the entitylevel representation of Row&Entity representation will be also used to adaptively fuse the information of KB in the Decoder of COMET. More details can be found in the aforementioned papers.