Constraint based Knowledge Base Distillation in End-to-End Task Oriented Dialogs

End-to-End task-oriented dialogue systems generate responses based on dialog history and an accompanying knowledge base (KB). Inferring those KB entities that are most relevant for an utterance is crucial for response generation. Existing state of the art scales to large KBs by softly filtering over irrelevant KB information. In this paper, we propose a novel filtering technique that consists of (1) a pairwise similarity based filter that identifies relevant information by respecting the n-ary structure in a KB record. and, (2) an auxiliary loss that helps in separating contextually unrelated KB information. We also propose a new metric -- multiset entity F1 which fixes a correctness issue in the existing entity F1 metric. Experimental results on three publicly available task-oriented dialog datasets show that our proposed approach outperforms existing state-of-the-art models.


Introduction
Task oriented dialog systems interact with users to achieve specific goals such as restaurant reservation or calendar enquiry. To satisfy a user goal, the system is expected to retrieve necessary information from a knowledge base and convey it using natural language. Recently several end-to-end approaches (Bordes and Weston, 2017;Wu et al., 2018;He et al., 2020b;Madotto et al., 2018) have been proposed for learning these dialog systems.
Inferring the most relevant KB entities necessary for generating the response is crucial for achieving task success. To effectively scale to large KBs, existing approaches Wu et al., 2018) distill the KB by softly filtering irrelevant KB information based on the dialog history. For * D. Raghu and A. Jain contributed equally to this work. † D. Raghu is an employee at IBM Research. This work was carried out as part of PhD research at IIT Delhi.  Figure 1: An example dialog between a driver and a system along with the associated knowledge base.
example, in Figure 1 the ideal filtering technique is expected to filter just the row 1 as the driver is requesting information about dinner with Alex. But existing techniques often filter some irrelevant KB information along with the relevant KB information. For example, in Figure 1 row 3 may also get filtered along with row 1. Our analysis of the best performing distillation technique (Wu et al., 2018) revealed that embeddings learnt for entities of the same type are quite close to each other. This may be due to entities of the same type often appearing in similar context in history and KB. Such embeddings hurt the overall performance as they reduce the gap between relevant and irrelevant KB records. For example, in Figure 1 row 3 may not get distilled out if Alex and Ana have similar embeddings.
In this paper, we propose Constraint based knowledge base Distillation NETwork (CDNET), which (1) uses a novel pairwise similarity based distillation computation which distills KB at a recordlevel, and (2) an auxiliary loss which helps to distill contextually unrelated KB records by enforcing constraints on embeddings of entities of the same type. We noticed the popular entity F1 evaluation metric has a correctness issue when the response contains multi instances of the same entity value. To fix this issue, we propose a new metric called multiset entity F1. We empirically show that CD-NET performs either significantly better than or comparable to existing approaches on three pub-arXiv:2109.07396v1 [cs.CL] 15 Sep 2021 licly available task oriented dialog datasets.

Related Work
We first discuss approaches that are closely related to our work. Wu et al. (2018) perform KB distillation but fails to capture the relationship across attributes in KB records. It represents a KB record with multiple attributes as a set of triples (subject, predicate, object). This breaks direct connection between record attributes and requires the system to reason over longer inference chains. In Figure  1, if event field is used as the key to break the record into triples, then the distillation has to infer that (dinner, invitee, Alex), (dinner, date, 1 st Feb) and (dinner, time, 10am) are connected. In contrast, CDNET performs KB distillation by maintaining the attribute relationships.  perform distillation using the similarity between dialog history representation and each attribute representation in a KB record, whereas CDNET uses word based pairwise similarity for distillation.
We now briefly discuss approaches that improve other aspects of task oriented dialogs. He et al.

CDNET
CDNET 3 has an encoder-decoder architecture that takes as input (1) the dialog history H, modelled as a sequence of utterances {u i } k i=1 and each utterance u i as sequence of words {w j i }, and (2) a knowledge base K with M records {r m } M m=1 and each record r m has N key-value attribute pairs {(k n , v n m )} N n=1 . The network generates the system response Y = (y 1 , . . . , y T ) one word at a time.

CDNET Encoder
Context Encoder: The dialog history H is encoded using a hierarchical encoder (Sordoni et al.,3 https://github.com/dair-iitd/CDNet 2015). Each utterance representation u i is computed using a Bi-GRU (Schuster and Paliwal, 1997). We denote the hidden state of the j th word in i th utterance as w j i . The context representation c is generated by passing u i s through a GRU. KB Encoder: We encode the KB using the multilevel memory proposed by Reddy et al. (2019) as its structure allows us to perform distillations over KB records. The KB memory contains two-levels. The first level is a set of KB records. Each KB record is represented as sum of its attributes r m = n Φ e (v n m ), where Φ e is the embedding matrix. In the second level, each record is represented as a set of attributes. Each attribute is a key-value pair, where the key k n is the attribute type embedding and the value v n m is the attribute embedding.

KB Distillation
The KB distillation module softly filters irrelevant KB records based on the dialog history by computing a distillation distribution (P d ) over the KB records. To compute P d = [d 1 , . . . , d M ], we first score each KB record r m based on the dialog history H as follows: where CosSim is the cosine similarity between two vectors. The distillation likelihood d m for each record r m then is given by d m = exp(s m )/ M q=1 exp(s q ). Defining distillation distribution over the KB records rather than KB triples has two main advantages: (1) attributes (such as invitee, event, time and date in Figure 1) in a KB record are directly connected and thus easy to distill, (2) it helps to distill the right records even when the record keys are not unique. In Figure 1, row 3 would be distilled even though it shares the same event name.

CDNET Decoder
Following Wu et al. (2018), we first generate a sketch response which uses entity type (or sketch) tag in place of an entity. For example, The @meeting with @invitee is at @time is generated instead of The dinner with Alex is at 10pm. When an entity tag is generated, we choose an entity suggested by the context and KB memory pointers.  Figure 2: Architecture of CDNET model. P g is computed using the decoder hidden state h t and an attended summary of the dialog context g t . The summary g t = i j a ij w j i , where a ij is the Luong attention (Luong et al., 2015) weights over the context word representations (w j i ). Context Memory Pointer: At each time t, generate the copy distribution over the context P con by performing a multi-hop Luong attention over the context memory. The initial query q 0 t is set to h t . q 0 t is then attended over the context to generate an attention distribution a 1 and a summarized context g 1 t . We represent this as g 1 t = Hop(q 0 t , x). In the next hop the same process is repeated by updating the query q 1 t = q 0 t + g 1 t . The attention weights after H hops is used for computing the context pointer P con as follows: KB Memory Pointer: At each time t, we generate the copy distribution over the KB P kb using (1) Luong attention weight β t m over the KB record r m and (2) Luong attention weight γ t n over attribute keys in a record k n and (3) the distillation weight d m over the KB record r m . The KB pointer P kb is computed as follows: The two copy pointers are combined using a soft gate α (See et al., 2017) to get the final copy distribution P c as follows,

Loss
We guide the distillation module using two auxiliary loss terms: entity constraint loss L ec and distillation loss L d . Often entities of the same type (e.g., Ana and Alex) have embeddings similar to each other. As a result, records with similar but unrelated entities are incorrectly assigned a high distillation likelihood. To alleviate this problem, we make the cosine similarity between two entities of the same type to be as low as possible. This is captured by the constraint loss L ec given by, where E is a set of entity pairs in the KB that belong to the same entity type.
The distillation likelihood d m of a KB record r m depends on the similarity between entities in the record and the words mentioned in the dialog context. We compute the distillation loss by defining reference distillation distribution d * m as s * m / M q=1 s * q , where s * m is the number of times any attribute in r m occurs in H and in the gold response. The distillation loss is given by, The overall loss function L = L g + L c + L ec + L d , where L g and L c are the cross entropy loss on P g and P c respectively. Detailed equations are described in Appendix B.   Evaluation Metrics: We measure the performance of all the models using BLEU (Papineni et al., 2002), our proposed multiset entity F1 and for completeness the previously used entity F1 (Wu et al., 2018).

MultiSet Entity F1 (MSE F1):
The entity F1 is used to measure the model's ability to predict relevant entities from the KB. It is computed by micro averaging over the set of entities in the gold responses and the set of entities in the predicted responses. This metric suffers from two main problems. First, when the gold response has multiple instances of the same entity value, it is accounted for just once in the set representation. For example, in Table 1 the entity value 11am occurs twice in the gold response but accounted for just once in the set representation. As a result the recall computation does not penalize the prediction pred-1 for missing an instance of 11am. Second, the existing metric fails to penalize models that stutter. For example, in Table 1 the precision of pred-2 is not penalized for repeating the entity value 8th.
We propose a simple modification to the entity F1 metric to fix these correctness issues. The modified metric, named MultiSet Entity F1, is computed by micro averaging over the multiset of entities rather than a set. As multisets allow multiple instances of same entity values, it (1) accounts for the same entity value mentioned more than once in the gold by penalizing recall for missing any instances and (2) accounts for models that stutters by penalizing the precision.

Results
The results are shown in Table 2. On CamRest and SMD, CDNET outperforms existing models in both MSE F1 and BLEU. On WOZ, CDNET achieves best only in MSE F1. We observed that the responses generated by CDNET on WOZ were appropriate, but did not have good lexical overlap with the gold responses. To investigate this further, we perform a human evaluation of the responses predicted by CDNET, FG2Seq and EER.
Human Evaluation: We conduct a human evaluation to assess two dimensions of generated responses: (1) Appropriateness: how useful are the responses for the given dialog context and KB, and (2) Naturalness: how human-like are the predicted responses. We randomly sampled 75 dialogs from each of the three datasets and requested two judges to evaluate on a Likert scale (Likert, 1932). The results are summarized in Table 3. CDNET outper-    forms both FG2Seq and EER on appropriateness across all three datasets. Despite having a lower BLEU score on WOZ, CDNET performs in-par with the other two baselines on naturalness.
Ablation Study: We perform an ablation study by defining three variants. Table 4 shows the MSE F1 and BLEU for the two settings on CamRest and SMD datasets. (1) We remove the entity constraint loss L ec from the overall loss L.
(2) We replace our pairwise similarity based score s m used for KB distillation with the global pointer score (s m = r T m .c) proposed by (Wu et al., 2018). We refer to this setting as naive distillation. (3) We replace our pairwise similarity based score s m with the entrylevel attention proposed by . We see that both our contributions: pairwise similarity scorer for computing distillation distribution and the entity constraint loss contribute to the overall performance.
Discussion: We now discuss the effect of the entity constraint loss L ec on the KB entity embeddings. Figure 3 shows the t-SNE plot (Van der Maaten and Hinton, 2008) of entity embeddings of CD-NET and GLMP where entities of the same type are represented using the same colour. We see that entities of the same type (e.g. father and boss of the type invitees) are clustered together in embedding space of GLMP, while they are distributed across the space in CDNET. This shows that the entity constraint loss has helped reduce the embedding similarity between entities of the same type and ensures KB records with similar but unrelated entities are filtered by the KB distillation. Visualization of distillation distribution helping identify relevant KB entities is shown in Appendix C.

Conclusion
We propose CDNET for learning end-to-end task oriented dialog system. CDNET performs KB distillation at the level of KB records, thereby respecting the relationships between the connected attributes. CDNET uses a pairwise similarity based score function to better distill the relevant KB records. By defining constraints over embeddings of entities of the same type, CDNET filters out contextually unrelated KB records. We propose a simple modification to the entity F1 metric that helps fix correctness issues. We refer to the new metric as multiset entity F1. CDNET significantly outperforms existing approaches on multiset entity F1 and appropriateness, while being comparable on naturalness and BLEU. We release the code for further research. All the hyper parameters are finalised after a grid search over the dev set. We sample learning rates (LR) from {2.5 × 10 −4 , 5 × 10 −4 , 10 −4 }. The Disentangle Label Dropout (DLD) rate (Raghu et al., 2019) is sampled from {0.0, 0.05, 0.10, 0.15, 0.20}. The number of hops H in the response decoder is sampled from {1, 3, 5}. We ran each hyperparameter setting 10 times and use the setting with the best validation entity F1. The best performing hyperparameters for all datasets are listed in Table 5.  All experiments were run on a single Nvidia V100 GPU with 32GB of memory. CDNET has an average runtime of 3 hours (6 min per epoch), 10 hours (20 min per epoch) and 24 hours (36 min per epoch) on CamRest, SMD and WOZ respectively. CDNET has a total of 2.8M trainable parameters (400K for embedding matrix, 720K for context encoder, 240k for the sketch RNN and 1440k for the Memory pointers).

B Detailed Equations
In this section, we describe the details of context encoder, CDNETdecoder and the loss.

B.1 Context Encoder
Given a dialog history H we compute the utterance representation u i and context representation c as follows: where τ i is the number of words in utterance u i and w j i is the j th word in the i th utterance.

B.2 CDNET Decoder
Let h t and y t be the hidden state and the predicted word at time t respectively. The hidden state is computed as follows, Now, we compute multi-hop Luong attention over the words representations w j i in the context memory. We set the initial query q 0 t to h t and then apply Luong attention as follows: where W 1 , W 2 are trainable parameters. We then compute the summarized context representation g 1 t and the next hop query as follows: We repeat this for H hops. The attention vector after H hop is represented a H . The generate distribution P g is given by: where W 3 and b 1 are trainable parameters. The context copy distribution P con is computed as follows: The KB copy distribution P kb is given by, where W 4 , W 5 , W 6 and W 7 are trainable parameters. Now we compute the gate α to combine P con and P kb to get a final copy distribution P c as follows: where W 8 is a trainable parameter.

B.3 Loss
We compute the cross entropy loss over the generate P g and copy P c distribution as follows:

C Distillation Visualisation
We show the visualisation of how the KB distillation distribution helps the decoder rectify the incorrect KB memory pointer inference in Figure 4. Figure 5 shows how the KB distillation distribution helps increase the confidence associated with the correct entity in the KB.

D Datasets
We present statistics of SMD, CamRest and WOZ in Table 6.   Table 7 and Table 8 show the domain wise entity F1 scores of SMD and WOZ datasets respectively. We note that CDNET either has the best or the second-best performance in domain wise scores.

E Domain-Wise Results
F Qualitative Example Table 9 shows responses predicted by CDNET, EER and FG2Seq for an example from the WOZ dataset.
G Human Evaluation Figure 6 shows a screenshot of the task used for collecting human judgements.   where is the closest grocery_store ?
we are 4_miles away from whole_foods and from safeway : which one do you prefer ? User safeway User @poi is located at @address CDNet CDNet Figure 4: Attention visualization of a decode time step of an example from SMD dataset. P kb corresponds to the sketch tag @poi. P kb is computed by combining the output of the KB memory pointer and the distillation distribution P d .
where is the closest grocery_store ?
we are 4_miles away from whole_foods and from safeway : which one do you prefer ? User safeway User @poi is located at @address   Figure 5: Attention visualization of a decode time step of an example from SMD dataset. P kb corresponds to the sketch tag @address. P kb is computed by combining the output of the KB memory pointer and the distillation distribution P d .