Knowledge Enhanced Fine-Tuning for Better Handling Unseen Entities in Dialogue Generation

Although pre-training models have achieved great success in dialogue generation, their performance drops dramatically when the input contains an entity that does not appear in pre-training and fine-tuning datasets (unseen entity). To address this issue, existing methods leverage an external knowledge base to generate appropriate responses. In real-world practical, the entity may not be included by the knowledge base or suffer from the precision of knowledge retrieval. To deal with this problem, instead of introducing knowledge base as the input, we force the model to learn a better semantic representation by predicting the information in the knowledge base, only based on the input context. Specifically, with the help of a knowledge base, we introduce two auxiliary training objectives: 1) Interpret Masked Word, which conjectures the meaning of the masked entity given the context; 2) Hypernym Generation, which predicts the hypernym of the entity based on the context. Experiment results on two dialogue corpus verify the effectiveness of our methods under both knowledge available and unavailable settings.


Introduction
Owing to large amounts of conversation data and pre-training models (Zhang et al., 2020;Roller et al., 2020), generation-based chatbots have achieved significant advances and even reach human parity on specific testsets (Zhang et al., 2018;Dinan et al., 2019;. However, the robustness of the pre-trained model is still low with regard to unseen entities (Zhang et al., 2016;Dinan et al., 2019). In practice, users often talk with chatbots about latest news and the recently hot topics (Morris et al., 2016), which may not appear in the pre-training or fine-tuning corpus. For instance, "COVID-19" is a new term, which * Contribution during internship at MSRA.
(a) Non-knowledge dialogue generation.
(b) Knowledge grounded dialogue generation. Note that the knowledge of "COVID-19" can not be retrieved from the knowledge base, because it is a new term.  does not appear in the training data of Blender 1 (Roller et al., 2020), leading to poor performance when a user mentions "COVID-19". As shown in Figure 1(a), given an utterance "I've tested positive for COVID-19", Blender yields a bad response "That's great news" because it misunderstands the utterance by the word "positive", which poses a real challenge for building an accurate and robust generation-based chatbot.
Existing methods leverage external knowledge to tackle the problem (Ghazvininejad et al., 2018), where chatbots retrieve relevant knowledge about the entities from a external knowledge base and use the retrieved knowledge to help generating appropriated responses. However, these methods heavily depend on the coverage of the knowledge base and the accuracy of knowledge retrieval, which may fail when the entity is not included by the knowledge base (Wu et al., 2021) or the retrieved knowledge is inappropriate (Lian et al., 2019). As shown in Figure 1(b), the knowledge retriever fails to retrieve "COVID-19" from the knowledge base, yielding an incorrect response. According to our statistics, the knowledge retrieval failure is not rare in real practice. Taking Reddit as an example, we collect 407 dialogues over 40 topics on the Trendings panel and find that 24.8% of the topic words are polysemous, indicating the probability of incorrect knowledge retrieval, and 47.9% of topic words are not included by the Wikipedia. To date, there are few studies that have investigated how to build a dialogue generation model within which knowledge may be unavailable during inference.
We solve this problem by proposing a knowledge enhanced fine-tuning method, trying to understand semantic information of entities based on the context. For example, given the sentence "I want to submit a paper to EMNLP", a person may not know what "EMNLP" is, but he/she can guess that it should be a conference or a journal, based on the context. Similarly, we aim to enhance the semantic representation of unseen entities by guiding the model to learn the meaning of the words only based on the context information.
To achieve this, we take Blender (Roller et al., 2020) as our backbone model, and propose two auxiliary training objectives (Figure 1(c)) in finetuning, dubbed as Knowledge Enhanced Blender (KE-Blender). The first objective is Interpret Masked Word, which predicts the word's definition based on the context, where the definition is obtained from a knowledge base. The second is Hypernym Generation, which predicts the corresponding hypernym of the word given by WordNet. These two introduced training objectives force the model to learn semantic information from the external knowledge base during training, guessing the meaning of the unseen entity with its context, so as to better understand the input utterance and generate relevant responses during inference. Both training objectives do not require further human labeling, which makes it possible for extending to large-scale pre-training.
Results on the Wizard of Wikipedia benchmark show that the proposed model brings performance improvement. The proposed method achieves 14.9 and 18.4 PPL on Wizard Test Unseen in the knowledge available setting and unavailable setting, respectively, which outperforms the Blender baselines (16.3 and 19.9 PPL). To further verify the effectiveness of our method in real-world scenarios, we collect 407 dialogues on the Reddit Trendings panel, demonstrating the effectiveness of the proposed method in practice. We release our code and dataset at https://github.com/Nealcly/ KE-Blender.
2 Related Work 2.1 Knowledge Enhanced Pre-training BAIDU-ERNIE  uses entity-level masking and phrase-level masking strategy to enhance knowledge into language model. THU-ERNIE  incorporates contextual representations with separate KG embeddings. LUKE (Yamada et al., 2020) proposes an entityaware self-attention to boost the performance of entity related tasks. SenseBERT (Levine et al., 2020) uses WordNet to infuse the lexical semantics knowledge into BERT. KnowBERT (Peters et al., 2019) incorporates knowledge base into BERT using the knowledge attention. TNF (Wu et al., 2021) accelerates pre-training by taking notes for the rare words. Compared with these methods, which enhances the pre-trained encoder by utilizing named entities or knowledge base, we inject knowledge to improve the generation ability of seq2seq models given the unseen word.

Knowledge Grounded Dialogue Generation
With advances in deep learning, pre-trained language models have shown promising results in dialogue generation (Lewis et al., 2020;Zhang et al., 2020;Roller et al., 2020). To equip the models with external knowledge, Zhang et al. (2018) first show that adding user profile information is able to produce a more consistent and engaging response. Dinan et al. (2019) propose a Transformer memory network to retrieve knowledge from Wikipedia.  use two-step decoding, which first gen-erate a response based on context, and then take the generated response and relative knowledge as input to generate a new response. Kim et al. (2020) focus on knowledge selection in dialogue generation by utilizing a sequential latent variable model.  further enhance the selection module with the posterior information. Zhao et al. (2020b) use reinforcement learning to optimize knowledge selection with unlabeled data. Different from their work, our KE-Blender does not take knowledge as input, because knowledge is only used to enhance our model during training.

Task
Suppose that we have a training set , where U S i , K S i and R S i are the dialogue context, the external knowledge retrieved from the knowledge base and the response, respectively. In addition to D S , we have a test dataset D P = {U P , R P }. Unlike D S , D P does not contain external knowledge, because associated background knowledge for unseen word is difficult to obtain in real time during inference. Our goal is to learn a dialogue generation model P (R|U; θ) with the help of K S , where θ is the parameters of the model. It should be noted that, the dialogue generation model P (R|U; θ) generates the response R only based on the input context U, without using knowledge K as input.
In the following sections, we will introduce the model structure first, and then show how to leverage the external knowledge K to enhance the generation model P (R|U; θ) with our two proposed training objectives.

Baseline
We consider Blender and Knowledge Grounded Blender (KG-Blender) as our baselines in knowledge available and knowledge unavailable settings, respectively.
Blender Given a dialogue context U = {u 1 , ..., u l−1 }, we first concatenate U as a sequence of sentences U = {x 1 , x 2 , . . . , x T }. The response is denoted as R = {y 1 , y 2 , . . . , y T }. We train our model on the basis of Blender, which is a standard Seq2Seq Transformer architecture with pre-training. In particular, we feed the dialogue context U to the encoder of the transformer, and then we obtain hidden representations of the sen-tence h enc = TRANSFORMER_ENCODER(U) (1) At the t th step of the decoder, h enc and previous output tokens y 1:t−1 are then as inputs, yielding a representation using attention (Vaswani et al., 2017) h dec t = TRANSFORMER_DECODER(h enc , y 1:t−1 ) (2) The generative probability distribution of y t is given by where W o and b o are trainable parameters. We use the standard Maximum Likelihood Estimation to optimize the model parameters θ. Given a training pair (U, R), we minimize: We adopt Blender-90M (Roller et al., 2020) to initialize our Seq2Seq Transformer model, which has been pre-trained on 1.5B training examples from Reddit 2019.
Knowledge Grounded Blender One intuitive baseline to use knowledge is to take both the context and the knowledge as input. In particular, the concatenation of the context U and the associated knowledge K is fed to the transformer encoder: Similar to Eq 4, given a training pair (U, K, R), the loss function is Note that it is difficult to use KG-Blender directly when knowledge is unavailable, because KG-Blender relies knowledge as input.

Knowledge Enhanced Blender
To build a robust model for knowledge unavailable setting, we consider adding two auxiliary loss during fine-tuning. People try to understand an unseen word based on the context, even if a dictionary is unavailable. To simulate this behavior, we explicitly guide the Blender model to learn the meaning of words only based on the context information.
Interpret Masked Word The first objective is to ask the model to restore the definition of masked words. We can use different methods to select which words should be masked. For example, we could mask proper nouns in the utterance, or pre-defined topic word for specific dataset 2 . For example, the input text is "I submit a paper to the EMNLP". "EMNLP" is replaced by [MASK], yielding "I submit a paper to the [MASK]". The definition retrieved from Wikipedia is "EMNLP is a leading conference in the area of natural language processing and artificial intelligence ". Then, the pre-trained model is required to restore the definition by consuming the masked utterance as input. In this way, the model is explicitly guided to understand the background knowledge of the masked word given the context.
Formally speaking, given a single utterance u l−1 = {x 1 , x 2 , . . . , x T }, we assume that x i is the topic word in u l−1 , and its corresponding definition is denoted as K . , x T } as the input of Eq 1 in Section 3.2. To distinguish with the original dialogue generation task, we use a specific start token [DEFI] to mark that the target sequence is the definition. Given a training pair (u l−1 , K x i ), the training objective of the interpret 2 Wizard of Wikipedia dataset have defined topic words for each dialogue. masked word is: Hypernym Generation We also reconstruct the input utterance by replacing the topic words with the corresponding hypernym. Compared with topic words, the semantic field of its hypernym is more general. We use WordNet to construct our training instances. For instance, given an utterance u l−1 = {I submit a paper to the EMNLP}, we use "conference" to replace "EMNLP", where "conference" is the hypernym of "EMNLP", yielding the target sequence u l−1 = {I submit a paper to the con-ference}. This training objective aims to guide the model to understand the semantic information of unseen words. We use a specific start token [HYPE] to mark the target sequence is the hypernym generation. Given a training pair (u l−1 , u l−1 ), the training objective of the hypernym generation is: Training We optimize the dialogue generation loss with the two external loss at the same time: Reddit Trendings is a test set to simulate realworld settings, by crawling users' dialogue from its Trendings panel in 2021. Reddit Trendings panel contains the latest hot topics, and most of them are not included in the external knowledge bases. We first obtain topic words from the Reddit Trendings panel, then crawl the dialogue based on the topic words. We further filter the datasets by selecting out dialogue that includes at least 2 utterances, yielding a dataset which similar to the Wizard setting. Finally, the dataset consists of 407 utterances over 40 trending topics.

Setup
We implement KE-Blender with transformers and choose blenderbot-90M as the pre-trained language model. AdamW with a batch size of 128 is used to optimize parameters. The initial learning rate is set as 1e-5, which is halved in each training iteration. We set the maximum input tokens as 512. To ensure that KE-Blender also works well in knowledge available settings, we also create extra training instances by concatenating the context with the associated knowledge as input.

Baselines
We compare KE-Blender with Blender and KG-Blender, also drawing the following state-of-the-art methods as reference: Transformer (Vaswani et al., 2017) is a standard transformer model for dialogue generation. It takes the concatenation of context utterances and the associated knowledge as input. SKT (Kim et al., 2020) uses a sequential latent variable model for knowledge selection, and then generates the response based on the context and the selected knowledge.
DRD (Zhao et al., 2020a) is a pre-training model designed for the low-resource dialogue generation, which decomposes the decoder into independent components.
SKT + PIPM + KDBTS (Chen et al., 2020) uses posterior information to help prior knowledge selection module, and trains the decoder with knowledge distillation.
KnowledGPT (Zhao et al., 2020b) adopts reinforcement learning to optimize the knowledge selection module, which gives state-of-the-art performance on Wizard.
Blender-FT. Blender is a large-scale dialogue pre-training model. We fine-tune the Blender on Wizard training set without utilizing external knowledge.
KG-Blender. We fine-tune Blender on the Wizard training set by concatenating the context and the associated knowledge as the input. In the setting where external knowledge is unavailable, only context is used to generate response.

Metrics
Automatic evaluation metrics: Following Dinan et al. (2019) and Kim et al. (2020), models are measured using the perplexity of the groundtruth response (PPL) and unigram F1-score (F1). Ghazarian et al. (2019) show that BERT can be used to evaluate the generated response. We employ BERT-based evaluation metrics to evaluate whether the generated response is knowledgeable as supplements to PPL and F1. As shown in Table 3, the dialogue generation model is first required to generate responseR based on the di-    alogue context U = {u 1 , . . . , u l−1 }. We use the special token [MASK] to replace the topic word in the last context utterance u l−1 . Then a masked language model (i.e. BERT-large) is used to predict the masked topic word using the last context utterance u l−1 and the generated responseR. The recall@k for the masked language model is used to measure the knowledge stored in the dialogue generation model. Intuitively, if a dialogue generation model is more knowledgeable, the masked language model is stronger to predict the masked topic word based on the generated responseR and last context utterance u l−1 .
Human evaluation metrics: Manual evaluations are essential for evaluating dialogue generation (Ritter et al., 2011). We conduct human evaluations to compare KE-Blender with our baseline Blender and KG-Blender by randomly sampling 200 instances from the Wizard Test Unseen. We define three metrics for manual evaluation, including fluency, knowledgeability and coherence. Each aspect is scored into three grades, 0, 1 and 2, representing "bad", "normal" and "good" respectively. Following Wu et al. (2018), we employ three annotators to do a side-by-side human evaluation, and report the Fleiss Kappa (Fleiss et al., 1971) to show the agreement among human annotators.  ing that KE-Blender is more robust when it comes to Test Unseen.

Results
w/ Knowledge During Inference Previous work (Kim et al., 2020;Zhao et al., 2020b) focuses on how to better leverage knowledge for dialogue generation. Compared with these strong baselines, KE-Blender performs competitively in the knowledgeavailable setting, even though KE-Blender is designed for knowledge-unavailable setting. Notably, it achieves the best reported PPL on Wizard. The results are consistent with our intuition. Knowledge grounded methods perform well when knowledge is provided during inference, and our method is robust and does not degrade in the w/ knowledge setting.
w/o Knowledge During Inference When external knowledge is unavailable during the inference stage, knowledge grounded methods cannot be directly applied to this setting since it requires knowledge as an input. Hence, we compare KE-Blender with Blender and KG-Blender. As can be seen from Table 1, our method shows large advantages in all metrics, achieving a 16.7 F1 score. It shows that our external training objectives can help model to generalize better when meet unseen words. Table 2 shows the recall of masked topic words predicted by a BERT model, where a higher recall score indicates the stronger correlation between the knowledge and the response. Human's response obtains a higher score, which means our evaluation metric is reasonable and there is still a gap between human's reply and machine's reply. Our method gives the best performance in both settings, demonstrating strong performance when knowledge is absent, which shows that our auxiliary training objectives is able to help model to learn a better semantic representation. Surprisingly, it also outperforms simple knowledge grounded methods when knowledge is available. Table 3 compares KE-Blender with baselines using human evaluation. All models are able to produce fluent response due to the power of pre-training. Inference with the retrieved knowledge is particularly helpful for the model to generate a more knowledgeable and coherent response. When knowledge is unavailable, KE-Blender significantly outperforms Blender and KG-Blender (p<0.01) measured in both knowledgeable and coherent, also giving highly competitive results with the model using knowledge as input.

Human Evaluation
The value of Fleiss' Kappa (Fleiss et al., 1971) exceed 0.59 on all models, showing a high inter-rater agreement among annotators.
Low-Knowledge-Resource Setting To simulate a low-knowledge-resource setting, we start from using the full knowledge in Wizard Test Unseen, and gradually reduce the amount of knowledge by randomly removing some entries. Figure 4 shows the trends when different percentage of knowledge in Test Unseen is removed. As the ablation knowledge increases, the performance of the two methods significantly decreases.

Analysis
Ablation Study An interesting question is to explore the contribution of the two auxiliary losses in training. The results are shown in Table 4. We can see that each loss contributes a lot in automatic evaluation, with F1 increasing largely by adding each objective. When combining the two losses, there is still an improvement but marginal, which indicates the two loss may play similar roles for pre-training model. Table 6 shows an example of the model responses on the Wizard of Wikipedia Test Unseen. Under the knowledge-available setting, all models generate reasonable responses with the help of relevant knowledge. Both models mention that "Elvis" was in "jailhouse rock" by consulting the external knowledge base. When knowledge is unavailable, Blender-FT gives a non-informative response because it cannot understand the word "Elvis". In contrast, KE-Blender shows superior performance by producing informative and knowledgeable responses, which directly points out that "Elvis" appears in a lot of movies and also in a few TV shows. This case shows that our model can significantly improves response quality when knowledge is absent, while sustain good performance when knowledge is available.

Case Study
Knowledge Mining Although we add two additional tasks in training, it is unclear how well the model performs in these two tasks. Therefore, we further evaluate whether explicit knowledge can be recovered from our model given the unseen entity. First, we find that the perplexity of the ground-truth Wikipedia knowledge on Test Unseen is only 6.81. As shown in Table 7, our model is able to produce reasonable definition based on context information and the pre-trained knowledge, and generate hypernyms for a given word associated with context. These show that rich-knowledge is stored in KE-Blender during knowledge enhanced fine-tuning, which potentially allows us to ground open domain dialogues without external knowledge.

Conclusion
We presented KE-Blender for better handling response generation based on unseen words, which enables a model to generate knowledgeable response without external knowledge during inference. To explicitly inject the knowledge into the model, we proposed two training objectives, including interpret masked word and hypernym generation. To simulate real-world scenario, we also released a test set on Reddit Trendings. Results on Wizard and Reddit Trendings show that KE-Blender outperforms several state-of-the-art methods and strong baselines in settings both when external knowledge is available and unavailable.