KILM: Knowledge Injection into Encoder-Decoder Language Models

Large pre-trained language models (PLMs) have been shown to retain implicit knowledge within their parameters. To enhance this implicit knowledge, we propose Knowledge Injection into Language Models (KILM), a novel approach that injects entity-related knowledge into encoder-decoder PLMs, via a generative knowledge infilling objective through continued pre-training. This is done without architectural modifications to the PLMs or adding additional parameters. Experimental results over a suite of knowledge-intensive tasks spanning numerous datasets show that KILM enables models to retain more knowledge and hallucinate less while preserving their original performance on general NLU and NLG tasks. KILM also demonstrates improved zero-shot performances on tasks such as entity disambiguation, outperforming state-of-the-art models having 30x more parameters.


Introduction
Large pre-trained language models (PLMs) (Radford et al., 2019;Lewis et al., 2020a;Raffel et al., 2020) have achieved great success across all NLP tasks. However, recent studies also reveal that PLMs are susceptible to memorizing the pretraining corpora rather than capturing the knowledge within them (Niven and Kao, 2019;Talmor et al., 2020;Yasunaga et al., 2022;Li et al., 2022). Particularly for generation tasks, PLMs are notorious for hallucinating text that is factually incorrect or hard to verify (Logan et al., 2019;Sun et al., 2020;Lin et al., 2020;Longpre et al., 2021). To address these issues, one approach is to retrieve relevant knowledge and integrate it explicitly with PLMs (He et al., 2020;Liu et al., 2021b). Another direction is incorporating the additional knowledge sources into the pre-training step (Zhang et al., * Work done in part while Yan was an intern at Amazon Alexa AI. 1 The code is available at https://github.com/alexa/kilm. 2019; Xiong et al., 2019;Liu et al., 2022;Wang et al., 2021b). While the former suffers from the issue of falling back on the models themselves without retrieved information (Krishna et al., 2021), knowledge-focused pre-training can be complementary to those methods (Longpre et al., 2021) and shows its advantage on generalization.
In this paper, we propose an approach for injecting knowledge into encoder-decoder PLMs, such as BART, as a continued pre-training process. We refer to it as Knowledge Injection into Language Models (KILM). Instead of introducing additional parameters to PLMs or modifying the model architectures to incorporate additional knowledge, KILM infills knowledge sentences by adopting a novel knowledge infilling objective that includes a knowledge reconstruction step in addition to the original pre-training objectives of BART.
The aim of KILM is to teach PLMs additional content about concepts and entities that they encounter in a given context, so that the models are able to ground an entity mention with additional information and "describe" what that entity is (see Figure 1). It should be emphasized that in this process, the context is especially important for cases when an entity mention can refer to multiple entities, e.g., Titanic which can refer to the British ship or to the 1997 movie. We utilize the short descriptions of entities in Wikipedia which comprise of entity definitions as the knowledge source ( §3.1). Although there are existing works leveraging similar knowledge for PLM enhancement, they ignore the relationship among entities, contexts, and entity-centric knowledge, and restrict their applications to NLU tasks. In contrast, we propose a distinct structure ( §3.2) to augment Wikipedia articles with short descriptions of the entity mentions in the context, thus model this essential relationship, so as to force PLMs to learn the correlation among entities and contexts, and differentiate between the entities with similar surface forms during Figure 1: The illustration of the proposed KILM technique for injecting knowledge into PLMs. In the given example, the mention, Joker, is linked to the page of Wikipedia entity Joker (character). While the figure only shows knowledge infilling, knowledge masking, and masked knowledge reconstruction steps, the proposed method is combined with the original pre-training objectives of PLMs for continued pre-training.
continued pre-training. With recent work that highlights the need for explicit grounding for PLMs to truly understand text (Merrill et al., 2021), we posit that KILM takes a step in that direction.
The proposed structure for knowledge infilling in KILM is further leveraged as a structured prompt in downstream tasks (see §4.2). We demonstrate better knowledge retention with KILM in zero-shot for entity disambiguation and appositive generation tasks, showing the effectiveness of the proposed method. We also find that BART with KILM outperforms BART on QA tasks and is less prone to hallucination on tasks such as knowledge-grounded response generation. As mentioned earlier, KILM relies on continued pre-training of PLMs, which presents the possibility of catastrophic forgetting of original skills of the PLM. We mitigate this by retaining the original training objectives of BART during the continued pre-training stage. We empirically verify that our proposed objective does not degrade the general language modeling ability of the PLM, nor affect the fluency of these models for natural language generation (NLG) tasks. Although we focus on short descriptions of entities as the knowledge source for KILM, other forms of knowledge can also be used, which we leave for future exploration.
We summarize our contributions as follows: (1) We propose a novel approach, KILM, to leverage Wikipedia annotations in pre-training of PLMs. We inject knowledge into BART, solely through continued pre-training, with no change in the architecture of the PLMs. KILM enables entity-based knowledge injection with knowledge in naturallanguage form. KILM's distinct structure also offers a direct way to probe the entity knowledge retained in pre-trained models.
(2) We show that KILM enhances the performance of BART on knowledge-intensive tasks while maintaining its original performance on other downstream tasks. KILM demonstrates improved zeroshot performance on entity disambiguation task, outperforming state-of-the-art models having 30x more parameters.

Related Work
Knowledge-Enhanced LMs To enhance PLMs' use of knowledge, a number of work has attempted to augment them with external knowledge sources, such as knowledge graphs (KGs) (Yin et al., 2022). Some recent work introduced additional non-parametric memories into the models (Zhang et al., 2019;Rosset et al., 2020) to obtain entity embeddings and modified the model structures to accommodate extra information (Yamada et al., 2020;Wang et al., 2021a,b), while others changed the masking schema with the additional information (Sun et al., 2019;Wang et al., 2022), or converted the external KGs into natural language text as an additional pre-training corpus (Xiong et al., 2019;Zhou et al., 2020;Liu et al., 2022;Agarwal et al., 2021;Li et al., 2022).

Modeling with Text Linking and Enrichment
Our motivation bears similarity to text linking (Yasunaga et al., 2022;Deng et al., 2021;Arora et al., 2022) during pre-training andtext enrichment (Elazar et al., 2022). Modeling the links between documents or metadata is motivated by the fact that PLMs, pre-trained on plain text, are not directly trained to capture inter-dependencies between documents. The similarity between the above tasks and ours lies in the ways humans implicitly link information when reading or generat-ing language. However, the former tasks are restricted to relationships within the text, while our goal is to ground the concepts and entities to their related descriptions in encyclopedic resources.
Pre-training with Hypertext Besides PLMs that are pre-trained with natural language corpora, HTLM (Aghajanyan et al., 2021) directly pre-trains simplified crawled HTML data based on BART models and CM3 (Aghajanyan et al., 2022) extends HTLM into a multimodal setting with causal masked language modeling. The target of HTLM and CM3 is to better leverage the enormous webscraped data source for pre-training. In contrast, our work aims to leverage hypertext to explore how to inject extra knowledge into PLMs with a customdesigned structure to furnish advantages to PLMs in performing knowledge-intensive tasks.

Methodology
Although KILM is model-agnostic and could be used for any PLM (more on this in §5), in this work, due to high computation costs, we focus on applying KILM to BART (Lewis et al., 2020a).

Preliminaries
Wikipedia is a widely-used text corpus for LM pre-training. It is often processed as a collection of individual articles in the form of flat natural language text. However, due to the existence of hyperlinks in its text, Wikipedia is also a complex web of connected Wikipedia topics, also known as Wikipedia entities. These hyperlinks build connections between different Wikipedia entities and establish a rich source of information that is mostly ignored in current pre-training approaches. Moreover, most Wikipedia articles come with a short description of the entity (topic) discussed in the article. These short descriptions provide definitions for Wikipedia entities. In this work, we take an initial step towards using these additional information within Wikipedia articles and utilizing "short descriptions" of entities for continued pre-training of PLMs. Note that the proposed approach could be expanded to other annotated text corpora.

KILM: Knowledge Injection into Language Models
We propose KILM, which extends the text-infilling objective to knowledge infilling objective through continued pre-training. KILM, as shown in Figure 1, consists of three steps: (1) knowledge in-filling, (2) knowledge masking, and (3) masked knowledge reconstruction.
Knowledge Infilling As mentioned in §3.1, in this work, we mainly focus on injecting PLMs with hyperlinks and entity descriptions as the entityrelated knowledge into PLMs. Specifically, we process Wikipedia data such that entity mentions in Wikipedia articles (which are annotated by hyperlinks) are marked with a start-of-entity token <ent> and an end-of-entity token </ent>. Also, each entity mention is followed by an entity-related knowledge sentence marked with <ent_desc> and </ent_desc> as start-and end-of-description tokens. The inserted knowledge component (highlighted in blue in Figure 1) consists of the corresponding hyperlinked entity (which might be different from the entity's surface form in the text) and the entity's short description connected with the <sep> token, where the short description is obtained from a lookup table extracted from the Wikipedia dump. We denote this knowledge infilling transformation as KNINFILL.
Knowledge Masking The processed data is used for the continued pre-training of a PLM. During this step, we conduct knowledge masking transformation (denoted as KNMASK) and the model is trained to reconstruct the whole inserted knowledge component from a single <mask> token with respect to the context. More specifically, assuming the ith token t i is a mention of an entity, the masked input sequence X and the output sequence Y can be denoted as: where t n represents the nth token of the original target sequence and k l represents the lth token in the knowledge sequence of length L.

Masked Knowledge Reconstruction
The parameters θ of the PLM are optimized by a masked knowledge reconstruction loss: Since our goal is to inject entity-related knowledge without disrupting the function of the original BART as a general PLM, the masked knowledge reconstruction loss is combined with the original  text infilling objective of BART during continued pre-training. 2 At training time, the model is optimized by minimizing the reconstruction loss over the whole target sequence instead of only the recovered masked spans. As a result, the training objectives force the model to learn to copy the tokens from the input sequences when the token is not a mask token during the pre-training process. This is to help the model recognize the inserted knowledge components in the training sequences and ensure the fluency of the PLM on NLG tasks. The weights of different objectives for loss are calculated based on the proportion of the corresponding spans across the entire sequence. We summarize the proposed KILM algorithm in Appendix B. The advantages of leveraging this structure for training are two-fold. First, this structure builds an alignment between the entity-related knowledge and the corresponding mention in the paragraphs. Second, the injected knowledge can be easily induced by probing the PLM with the structured prompts proposed for KILM ( §4.2).

Experiments
We start by exploring the performance of BART+KILM on knowledge-intensive tasks ( §4.2). Later, we also demonstrate that KILM does not degrade the original language modeling skills of BART in both NLU and NLG benchmarks ( §4.3). 2 The comparison between the text infilling and sentence permutation objectives shows the advantage of the former objective over the latter (Lewis et al., 2020a), so we only preserve the text infilling objective for KILM to simplify the continued pre-training task.

Pre-training Details
Data To extract the short descriptions and the hyperlinks from Wikipedia articles, we process a Wikipedia dump from scratch. 3 We assign the first sentence of the Wikipedia page as the short description if the "short description" attribute is missing in the raw data. We use the processed data by only leveraging the paragraphs from the summary sections of Wikipedia as our primary training corpus (denoted as primary setting), while we also explore a data upscaling setting where we use the entire Wikipedia articles. We split the articles with document strides of 512 and consider one snippet as a data sample. We randomly select one entity from the paragraphs in each iteration for dynamic entity-centric knowledge injection. 4 After data preprocessing, we obtain a collection of 5.70 million data samples for the primary setting and 7.85 million data samples for the data upscaling setting from Wikipedia. We split the corpus into a training set and a validation set with around 10k samples, for evaluation. In the following sections, KILM without a subscript indicates that it is conducted under the default primary setting, while KILM under data upscaling setting will be denoted as KILM DU . For pre-training in the primary setting, the model is continually trained for 7,000 steps, and for the data upscaling setting, the model is trained for 50,000 steps. 5 Refer to Appendix C.1 for details.  Baselines Besides the original BART, we also report on another BART-base baseline that is continue pre-trained on a merge of Wikipedia corpus and short descriptions for 7,000 steps (same number of steps as KILM) with only text infilling objective. The short descriptions are converted to general text based on the format: "<Entity> is <Short Desc>". This model is denoted as BART-base+Merge. We demonstrate input and output formats of pre-training in Table C6. This baseline is introduced to separately evaluate the role of the distinct structure that is introduced in this work, as well as the additional training steps and data.

Knowledge-Intensive Tasks
First, we study the effectiveness of KILM on knowledge-intensive tasks (Petroni et al., 2019;Roberts et al., 2020;Petroni et al., 2021). As shown in Table 1, we evaluate BART+KILM on entity disambiguation and appositive generation tasks, which have similar objectives to the continued pre-training of KILM. We also evaluate if KILM can contribute to downstream tasks where the pretraining objective of KILM is not fully aligned with those of the downstream tasks. Specifically, We include question answering (QA) and knowledge grounded response generation (KGRG) tasks.
Zero-shot Entity Disambiguation The entity disambiguation task requires the model to link a mention q to the correct entity, given a context D and several candidate entities. Without fine-tuning, we evaluate BART+KILM by picking the candidate with the lowest perplexity of generating short descriptions {S i } N i=1 using structured prompts among the candidate entities {E i } N i=1 in entity disambiguation datasets. 6 It can be expressed as: subset of knowledge intensive tasks we also report the results for data upscaling setting too. 6 Note that the reference entities in this task come from Wikipedia, hence we can use the associated entity description We use the same datasets and candidate sets as those in Le and Titov (2018). InKB micro-F1 results are shown in Table 2, where CM3, a series of huge PLMs trained with multimodal hypertext (see §2), are tested in a zero-shot setting. We also included the performances of BART and BART-base+Merge for reference. 7 BART+KILM outperforms CM3-large, which has over 30x more parameters, for half of the datasets. BART+KILM DU outperforms CM3-large in four out of six datasets. CM3 as a PLM has an impressive performance on entity disambiguation task with no additional training, and this comparison shows that BART+KILM can outperform CM3 with much less parameters. We also present results comparing BART+KILM with BLINK  in Table C1, where we see that it performs competitively compared to BLINK (which is fine-tuned for entity disambiguation). Moreover, the large gap between the performance of BART+KILM and BART+Merge shows that the proposed distinct structure (and not necessarily the data) plays a key role in the performance of BART+KILM in this task.
Appositive Generation Appositive generation is the task of adding background information for named entities in a sentence in the form of an appositive phrase. As shown in Table 1, we construct structured prompts to probe PLMs without finetuning on ApposCorpus (Kementchedjhieva et al., 2020). We consider the generated texts recovered from the mask tokens in the short description field as the generated appositives. 8 for each reference entity. 7 More details are included in Appendix C.3. 8 Since the pre-training corpus of BART includes Wikipedia articles, BART can also recover appositives from  Table 3: Human evaluation results on Appositive Generation in News and Wikipedia domains on org-and person-type entities (see Appendix C.7). Ap., Pref., and NH. mean Is Appositive, Preference, and Not Hallucinated. Numbers in bold are significantly better than those from BART at p-value of 0.05 in a pairwise t-test.
Since automatic metrics only assess the text overlap based performance (Table C3 in Appendix C.4 with comparisons with SOTA), we conduct human evaluation for a more comprehensive evaluation from three aspects: Is Appositive (Ap.), Preference (Pref.), and Not Hallucinated (NH.). Ap. evaluates whether the generation is an appositive or not, while Pref. evaluates the suitability of the generated appositives to the context. NH. evaluates whether the model generates a hallucinated appositive or not, verifying whether the generated appositive is factually correct. Pairwise A/B testing is utilized to compare the performances of BART before and after KILM (in the primary setting) on all four subsets of ApposCorpus. For each comparison, the same context and two options generated by models for comparison are first randomly shuffled and then are shown to the annotators. Each comparison requires three judgments. 50 data samples are randomly selected from each subset. More details of human evaluation are included in Appendix C.7. Table 3 lists the human evaluation results in terms of the winning rate (ties are counted as wins for both), where we observe that BART+KILM generates better appositives and hallucinates less in all four subsets. These results indicate that BART+KILM possesses more entity-related knowledge than BART.
In-Context Few-Shot QA The implicit knowledge embedded in the parameters can support large PLMs to obtain competitive results on opendomain QA tasks without accessing external knowledge (Roberts et al., 2020;Radford et al., 2019;Brown et al., 2020). We conduct in-context fewshot experiments, in the primary setting of KILM, mask tokens without further task adaptation.  Table 1, while the example QA pairs are retrieved with a TF-IDF retriever 9 from the corresponding training set. The tokens recovered from the mask tokens from the decoder will be considered as the generated answers. We illustrate learning trends with different "shots" in Figure 2 on all three datasets. Interestingly, BART+KILM mostly performs worse than the original BART under the zero-shot setting. However, appending demonstrations into the contexts enables BART+KILM to outperform the original BART by a large margin. With the data upscaling setting, KILM DU provides comparable (or even larger) improvements to BART under the few-shot setting while slightly improving the zero-shot performances of BART. Though far from perfect, these results suggest that KILM significantly improves the in-context learning ability of BART on all three QA datasets. KILM also enables BART to pack factoid knowledge more effectively within its parameters, which supports QA. BART-base+KILM outperforms BART-large under the in-context fewshot setting for the NQ and WQ datasets. The performance of the baseline model, BART+Merge, shows a similar trend to BART+KILM with little advantage on NQ and WQ datasets. This indicates that pre-training with data in "<Entity> is <Short  with SKT (Kim et al., 2019). Note that the performance of BART+Merge shows no difference from BART+KILM, which suggests that the introduced distinct structure does not affect BART's application of injected knowledge on WoW.
While automatic metrics are important in KGRG evaluation, they do not always tell the whole story (Hazarika et al., 2022), therefore we also conduct human evaluation on WoW test sets from three aspects, namely Fluency (Flu.), Informativeness (Info.), and Not Hallucinated (NH.). Flu. focuses on whether the responses are fluent and consistent with respect to the conversation so far, while Info. evaluates whether the responses contain verifiable factual information. The evaluation on NH. is only valid when a response is informative. The settings of human evaluation are the same as those for appositive generation (see Appendix C.7). The results in Table 5 demonstrate that BART+KILM performs comparably with BART in terms of fluency and informativeness, while it tends to hallucinate less when generating factual information in the responses, especially in unseen domains.

General Tasks
We now evaluate the impact of KILM on models' performance on general NLU and NLG tasks using the GLUE benchmark (Wang et al., 2018) and summarization datasets, CNN/Dailymail (Hermann et al., 2015) and XSUM (Narayan et al., 2018), by fine-tuning both BART and BART+KILM for comparison. The summary of the results is shown in Table 6, and the detailed results shown in Table C4 and Table C5. BART+KILM outperforms BART marginally on GLUE and the differences for summarization datasets are small. These results suggest that KILM preserves the performance of the original BART on downstream NLU and NLG tasks, and even in some cases it improves it. They also verify that KILM does not cause catastrophic forgetting of the original learnings in BART, thus making BART+KILM a reliable PLM.

Discussions
Roles of Introduced Special Tokens The introduced special tokens to mark beginning and end of entities (<ent>, </ent>) and entity descriptions (<ent_desc>, </ent_desc>) form a distinct structure in pre-training samples, which inserts entitycentric knowledge into pre-training corpora, thus injects knowledge in PLMs. We discuss the roles of these special tokens from the following aspects: Entity Knowledge Probing: This distinct structure in KILM provides a tool for probing the entityrelated knowledge retained in PLMs. To demonstrate this, we probe BART+KILM by prompting it to generate short descriptions for entities in validation set 10 of the pre-training corpus. The probing format and the corresponding results are shown in Appendix A.1 and Table A1. BART+KILM achieve around 60 unigram F1 scores with no performance gap with the data samples from a subset of the training set. These results indicate that we can easily recall the entity description knowledge in different contexts without sensitivity to prompt designs. It is shown that the proposed pre-training structure is the main contributor of the improvements on entity-related datasets, especially in zeroshot manner. By leveraging the introduced special tokens, the knowledge retained in PLMs can be more efficiently leveraged on downstream tasks.
Structured Prompt: The special tokens also provide convenient knowledge probing for zero-shot entity-centric tasks, such as entity disambiguation and appositive generation ( §4.2).
Are New Special Tokens Needed? There are a few reasons for introducing new special tokens in KILM for marking entities and their descriptions instead of reusing existing tokens, such as commas or parentheses. First, many entities have commas and parentheses in their names, making the entity descriptions indistinguishable from the contexts. For instance, there are 378,093 entities in English Wikipedia with a comma in their names, such as the entity "Mars, Aurgazinsky District, Republic of Bashkortostan". Second, using commas or parentheses could break the fluency of the text. In a context like "The Baltic states [...] is used to group three countries: Estonia, Latvia, and Lithuania", adding a short description for the entity "Estonia" using a comma would break the fluency of the sentence. Finally, using commas or parenthesis will overload their meanings, and during prompting of the model for knowledge probing it will result in a lack of clarity for the model as to how the comma or parenthesis should be interpreted.
Is KILM's impact equal on different domains and tasks? Despite the above-mentioned gains, BART+KILM appears to be less knowledgeable than BART on person-type entities, as manifested in the performance gap between organization-and person-type entities in appositive generation (Table 3). That may be due to the type of knowledge content injected by KILM. The entity knowledge required for generating appositives varies vastly from biographies to relationships with other people. However, short descriptions in Wikipedia for person-type entities focus mostly on their nationality and occupation. Also, many of them are similar 11 . This problem also affects the performance in Table A2 on G-RE datasets in LAMA benchmark. More analyses are in Appendix A.2. We leave the study of enriching the knowledge content for pre-training as future work.
The proposed pre-training structure shows its strength in entity-related tasks. Nevertheless, KILM may downgrade to conventional knowledgeaugmented pre-training (BART+Merge) when the pre-training objective of KILM is not fully aligned with those of the downstream tasks.

Placement of Knowledge Component
An ablation study on the knowledge component placement in KILM is presented in Appendix A.3, where we show that putting short descriptions right after entity mentions results in better performance compared to placing them at the end of sentences.

Extending KILM for Other PLM Architectures
In this paper, we choose BART as the default PLM; however, KILM can also be applied to other PLMs by adjusting their training objectives for knowledge infilling. For decoder-only PLMs, such as GPT-2, the knowledge component, i.e., short descriptions, can be moved to the end of the target sequence (similar to CM3) instead of being adjoined the surface form of the entity. As for encoder-only PLMs, such as BERT, contrastive training strategy introduced in LinkBERT (Yasunaga et al., 2022) is one option for the training objective of KILM. Due to the substantial computational cost of training these models, we leave these explorations for future works.
Justifications on the additional cost during pretraining Injecting additional knowledge text into pre-training corpora may introduce additional costs during the pre-training process. While entity descriptions used in the paper are usually a onesentence definition of an entity, the average length of short descriptions is 13.81 words. Considering that we split the Wikipedia articles with document strides of 512, the inserted tokens for short descriptions only take 2.6% of the length of the whole sequence, which does not bring much more training cost.

Conclusion
In this paper, we propose a novel method, KILM, to inject entity-related knowledge into large PLMs through continued pre-training. Our approach enhances the performance of the original PLMs on knowledge-intensive tasks, especially in zero-and few-shot settings, while not causing catastrophic forgetting of the knowledge in the origianl PLMs. The proposed distinct structure for entity knowledge shows its effectiveness on flexibly probing the injected knowledge in different contexts.

Limitations
In this paper, we propose a continued pre-training method to inject knowledge into large pre-trained language models. There are eight V100 GPUs involved in each pre-training experiment and the whole pre-training process takes 5 days for the base-size model and 13 days for the large-size model, in primary settings. These numbers in data upscaling settings are significantly greater (30 days for the large-size model). Despite its advantage in reducing resource need in inference time, KILM is both time-consuming and computationally resource-consuming during training time. Similar to any model-based generation system, KILM could be prone to generating factually incorrect statements with regard to entities. These statements might also be prone to be biased based on ethnicity, race, and sexual orientation.

A.1 Entity Description Probing
We analyze the quality of the knowledge injection process by evaluating the model's performance on entity description probing with structured prompts. This task is aligned with our proposed pre-training objective and reflects the effect of the continued pre-training. This can be considered as a plug-andplay process for knowledge induction by simply inserting the proposed distinct structure. We conduct evaluation on the validation set and a subset of the training set with around 10k data samples of our pre-training corpus. The data samples in the training subset are randomly selected, whereas the data samples in the validation set are not included in the training process. More specifically, the entities in the validation set may appear in the training set. However, the contexts of the entities in the paragraphs do not. We demonstrate the structured prompts for entity description probing as follows:  the same until the <ent_desc> token (marked with underline). Similar to the decoder-only models, the model is expected to continue generating entity descriptions following the prompt, until the </ent_desc> token is generated. The generated entity descriptions are evaluated with exact match (EM) and unigram F1 scores. As the results are shown in Table A1, for KILM in the primary setting, BART models with KILM achieve around 40 EM and 60 F1 scores. Interestingly, there is a marginal performance gap between the seen and unseen validation sets. The results indicate our model not only embed the knowledge with its parameters, but also can recall the injected knowledge under unseen contexts without much performance loss.

A.2 LAMA Knowledge Probing
Petroni et al. (2019) proposed the LAMA benchmark to provide an in-depth study of relational knowledge in PLMs by probing the answers to "fill-in-the-blank" cloze statements. Different types of relational knowledge are evaluated with statements semi-manually constructed from different knowledge sources, including Google-RE (G-RE), T-REx (Elsahar et al., 2018), ConceptNet (C-Net) (Speer et al., 2012) and SQuAD (Rajpurkar et al., 2016). We follow the original LAMA settings, while only keeping the data samples whose answer length is 1 after tokenization. The probing input and output format of BART and BART+KILM is shown as followings: Input/Prompt: The Teatr Wielki is a <MASK> . Target: theatre Similar to entity description probing in Appendix A.1, "Input" and "Prompt" (with underline) are inputs to BART encoder and decoder, respectively. The generation is considered to be correct only if it is exactly the same with "Target". We present the probing results in Table A2. We also include the results of BERT ( (Rosset et al., 2020) for reference. However, because of the differences on the tokenization and pre-training process, different PLMs are not comparable on LAMA benchmark (Jiang et al., 2020). Even though KILM does not inject relational knowledge into PLMs, we still observe improvements after KILM on all the datasets except G-RE. As it's discussed in §5, the injected knowledge of person-type entities is not aligned with the knowledge required by G-RE, since the samples from G-RE are focused on date_of_birth and place_of_birth relations in the person domain. Under the data upscaling setting, KILM DU further enhances the rational knowledge required for SQuAD, while LAMA performance is negatively impacted for other datasets. The results indicate that injecting the entity description knowledge also helps models better understand the relationships between specific entities. Moreover, the results of KILM DU suggest that the injected knowledge has closer relevance to the knowledge for SQuAD, whereas far from that of G-RE and T-REx.

A.3 Ablation Study
We conduct an ablation study on the knowledge component position in KILM. We compare our method with KILM variant that moves the knowledge component (highlighted in blue in Figure 1) including <ent_desc> and </ent_desc> to the end of the target sequence. The variant of the target sequence in Figure 1 is as follows: The Joker is a comic book series published by DC Comics starring the supervillain the <ent> Joker </ent> .
It ran for nine ... </s></s> <ent_desc> Joker (character)<sep>Fictional character throughout the DC Universe </ent_desc> We denote this KILM variant as KILM End . We evaluate these two models on entity description probing and zero-shot entity disambiguation tasks. As shown in Table A1 and Table C1, BART with KILM consistently outperforms BART with KILM End on both tasks. Despite the performance gap, the advantage of KILM End is that KILM End can also be applied to decoder-only models, such as GPT-2, for entity knowledge injection.

A.4 Data Scaling Laws
As mentioned in §4.1, we conduct continued pretraining under two settings: the primary setting and the data upscaling setting. While the primary setting only uses the paragraphs in Wikipedia summary sections, the data upscaling setting extends the training corpus to the whole Wikipedia corpus, which enlarges the training set by more than two million data samples and double the pre-training time. To study the effect of data scaling, we compare the performances of BART-large+KILM under primary and data upscaling settings on knowledgeintensive tasks, including entity disambiguation, LAMA, and closed-book QA tasks. The evaluation on entity disambiguation tasks involves six datasets and we only compare the average InKB F1 scores, since during data scaling, the performances are consistently improved across all the datasets.
In Figure A1, we show the performance difference between BART-large+KILM ( or KILM DU ) and the corresponding baseline models on entity disambiguation, LAMA (in the first row) and QA (including three datasets under 0/5-shot in the second row) tasks. We also display the performance differences along with each bar, where a positive number denote a better performance of BART+KILM. According to the comparison,  Figure A1: The performance difference between BART-large+KILM (or KILM DU ) and the corresponding baseline models on entity disambiguation, LAMA and QA (TriviaQA, NQ, and WB) tasks. More specifically, the baseline models of entity disambiguation tasks are CM3-large and BLINK with GENRE and BLINK candidates, while the baseline model of both LAMA and QA tasks is the original BART-large. We also display the performance differences along with each bar, where a positive number denotes a better performance of BART+KILM vs the baseline.

KILM in both settings shows little benefit for
Google-RE and T-REx datasets in LAMA benchmark and makes it harder for the model to recall the relational knowledge in specific domains. On the other hand, for the entitiy-based tasks, such as entity disambiguation, the injected knowledge through KILM equip BART with great zero-shot ability, comparing to the strong baseline models, which we've discussed in §4.2. For QA tasks, BART+KILM in the primary setting performs worse than the original BART model in a zeroshot manner, however, BART+KILM in data upscaling setting works comparably with the original BART in this case. Together all these comparisons, we conclude that KILM, as a proposed novel technique for entity-related knowledge injection, is able to largely benefit the model in terms of zeroshot ability on entity-based knowledge-intensive tasks. However, even though we jointly pre-train the model with the original text infilling objective of BART, catastrophic forgetting of some specific knowledge is unavoidable, especially in the data upscaling setting.

A.5 Case Study
Some selected data sample from ApposCorpus and WoW are shown in Table A3 and Table A4. For zero-shot appositive generation task, while the original BART-base model tends to generate appositives with similar surface forms to the gold ones or a piece of text that fit the context, it hallucinates a lot. BART-base+KILM is more knowledgeable on the actual meaning of the entities, however, it still make mistakes in terms of the date and specific occupation. For KGRG task with task-specific training, both models are able to generate fluent responses. At the same time, BART+KILM tends to hallucinate less by including a bit less information in some cases.

B KILM Algorithm
We denote the data transformations of the text infilling and sentence permutation objectives for BART as TEXTMASK and SENTPERM. In the original pre-training process of BART, given a target sequence with M tokens Y = {t 1 , t 2 , ..., t M }, and the corresponding corrupted input sequence X = {t ′ 1 , t ′ 2 , ..., t ′ N } with N tokens, the model, parameterized by θ, is optimized by minimizing the reconstruction loss over the whole sequence Y: For the proposed KILM continued pre-training, the original document, the selected entity, and the corresponding injected knowledge are represented as S = {t 1 , t 2 , ..., t N }, E, and K = {k 1 , k 2 , ..., k L }, respectively. The data transformation procedure can be represented as Y = KNINFILL(S, E, K), (5) X = KNMASK(Y).
(6) The final loss can be denoted as:  L = (1 − α − β)L copy + αL inf ill + βL kn , (7) where α and β are calculated based on the proportion of the corresponding spans across the entire sequence. The resulting KILM algorithm for continual pre-training is summarized in Algorithm 1.

C Additional Details for Experiments C.1 Pre-training Settings
We initialize the model with the original BART weights and it is continually trained on eight V100 GPUs with a batch size of 8,192. The models are optimized by the Adam optimizer with a linear scheduler and weight decay as 0.01. The peak learning rate is 5e − 5. Moreover, the maximum text length of the sequences with a knowledge component is set as 640. The mask probability and the hyper-parameter λ for Poisson distribution are the same as those of BART. The implementation is mainly based on HuggingFace Transformers (Wolf et al., 2020) and Datasets (Lhoest et al., 2021) packages.
It is worth mentioning that more than 2.3 million entities with short descriptions are involved in the pre-training, and, needless to say, the occurrence of entities in Wikipedia articles is not equally distributed. For instance, while only 2,526 entities appear more than 1,000 times in the primary setting, 40.5% of the entities only appear once in the training corpus.

C.2 Pre-training Format
We use a piece of Wikipedia article to demonstrate the input and output formats of the involved pretrained models involved in Table C6.

C.3 Zero-shot Entity Disambiguation
As shown in §4.2, we include the performance of BART and BART+Merge for reference. Due to the lack of conventional methods for evaluating BART models on zero-shot entity disambiguation tasks, we are inspired by the entity disambiguation model BLINK . We evaluate BART and BART+Merge by selecting the lowest perplexity candidate that generates the corresponding Wikipedia summary/short description from a given context. In addition, we also use the same datasets and the candidate sets as those in BLINK for more experiments. The InKB micro-F1 results are shown in Table C1, where BLINK is an entity linking model trained on TACKBP-2010 dataset. BLINK outperforms BART+KILM in the primary setting in all but one of the datasets, but BART+KILM DU in data upscaling setting largely closes the performance gap between BLINK. It should be noted that both BART+KILM is a general PLM, while BLINK is not.  Entity Frequency in Pre-training Data To study how the frequency of entities appearing in the pre-training text affects the entity linking performance, Figure C1 also shows the results of experimenting with data samples with different minimum frequencies of sampling the target entity during KILM pre-training in the primary setting. As the minimum frequency increases, the gap between BART+KILM and BLINK reduces.

C.4 Appositive Generation
We conduct zero-shot probing on ApposCorpus (Kementchedjhieva et al., 2020). We display the structured prompts of BART with KILM in  Table 3, the distinction in results between human evaluation and automatic metrics demonstrate how the latter do not capture important dimensions such as hallucinations.

C.5 In-Context Few-Shot QA
In Table C2, we list the QA results when providing one example QA pairs into the inputs (1-shot) to BART models with and without KILM. Aligning with the QA example in Table 1, the general evaluation format is as follows: Question: Example Q Answer: Example A\n Question: Test Q Answer: <mask> .
Besides BART, we also compare our performances with KALM (Rosset et al., 2020) under an 8-shot setting, for which the eight examples are human-written, and two finetuned models with similar model sizes. Despite the performance gap with finetuned models, BART+KILM shows a significant advantage over the original model and KALM on all the datasets, especially for large-size models. The 1-shot results of BART-base+KILM are even higher than those of KALM-large, which has many more trainable parameters.

C.6 Fine-tuning Experiments
For fine-tuning experiments, including GLUE, summarization, and KGRG tasks, we conduct each experiment with random seeds 0, 42, and 852. The    numbers reported in Table 6, Table C4, Table C5 and Table 4 above are the averages of the results with three random seeds. The results of BART are re-run with the original settings except maximum sequence length to be 1024 for summarization tasks. Pairwise t-tests are conducted to verify the significance level of the results of BART+KILM over the baseline model.

C.7 Human Evaluation
For both appositive generation and KGRG task, we conduct human evaluation for a comprehensive study. Pairwise A/B testing is utilized to compare the performances of BART before and after KILM (in the primary setting). For each comparison, the same context and two options generated by the models for comparison are first randomly shuffled and then are shown to the annotators. Both tasks evaluate the performances on whether the generations are hallucinated or not, named Not Hallucinated (NH.). We also include two more factors for each task. For ApposCorpus, we also evaluate the generated appositives from Is Appositive (Ap.) and Preference (Pref.), while we evaluate Fluency (Flu.) and Informativeness (Info.) for WoW. Because the dialogue task feature, we only consider the NH. factor when the generated response is informative for WoW task. Pairwise A/B testing is utilized to compare the performances of BART be-fore and after KILM on both ApposCorpus and WoW. Human evaluation is done among a group of experts fluent in English coming from countries across Asia. For each comparison, the same context and the generations from both models for comparison are shown to the annotators. The annotators are supposed to choose among "generation A", "generation B", "both", and "neither". Especially for the factor NH., the annotators are asked to search on the Internet for hallucination validation. Each comparison requires three judgments. We randomly sample 50 data samples from each subsets of Ap-posCorpus and 100 data samples from each WoW test set. Finally, 600 annotations are collected in total for both tasks.

D Datasets
A number of datasets for downstream task evaluation are involved in this work: GLUE Benchmark GLUE benchmark is a collection of text classification datasets, which is widely used to evaluate the language modeling ability of large PLMs. In this benchmark, nine datasets are involved, including binary QA and NLI tasks.
In this paper, we exclude WNLI (Morgenstern and Ortiz, 2015) task during evaluation because there are label conflicts in the dataset. 12 Summarization Datasets Text summarization is considered an essential NLG task, which requires the model to generate short summaries of long texts. In this paper, we test our models on two summarization datasets, CNN/DailyMail and XSUM. Summaries in the CNN/DailyMail tend to be more extractive, whereas XSUM contains highly abstractive summaries.

Entity Disambiguation Datasets
The entity disambiguation task is a subtask of entity linking. Given an entity mention in the context, the model is expected to select the correct entity among a set of similar candidates. Following BLINK  and GENRE (De Cao et al., 2020), we test our models on six entity disambiguation datasets, including AIDA-CoNLL dataset (Hoffart et al., 2011), MSNBC, AQUAINT, ACE2004, WNED-CWEB (CWEB) (Gabrilovich et al., 2013) and WNED-WIKI (WIKI) (Guo and Barbosa, 2018). We use the candidate sets from BLINK 12 https://gluebenchmark.com/faq and GENRE respectively, where those of GENRE are originally from Le and Titov (2018).
ApposCorpus Appositives are phrases that appear next to a named entity to provide background information (Bauer, 2017;Kang et al., 2019). They help the readers understand the semantics of the named entities in the context. ApposCorpus (Kementchedjhieva et al., 2020) is constructed as the first end-to-end dataset for the appositive generation task. The selected entities are Person and Organization entities from Wikipedia (Wiki) and News articles. Three types of appositives are included: constrained, empty, and a third type denoted as nonempty in this paper. Constrained appositive samples leverage WikiData for appositive generation, while empty appositive samples do not require the model to generate any appositives and non-empty samples require more general knowledge for the appositive generation. In this paper, since we do not conduct task-related training, we only evaluate our models on constrained and non-empty appositive samples.

Open-domain Question Answering Datasets
We Wizard of Wikipedia (WoW) dataset WoW is a common crowd-sourcing KGRG dataset that relies on Wikipedia knowledge to augment the dialogue responses when discussing various topics. Two speakers are provided with an initial topic during the data collection to start the conversation. There are two test sets, seen test and unseen test set, split for evaluation, where the initial topics of the dialogue samples in seen test set appear in the training set and vice versa.