LM-CORE: Language Models with Contextually Relevant External Knowledge

Large transformer-based pre-trained language models have achieved impressive performance on a variety of knowledge-intensive tasks and can capture factual knowledge in their parameters. We argue that storing large amounts of knowledge in the model parameters is sub-optimal given the ever-growing amounts of knowledge and resource requirements. We posit that a more efficient alternative is to provide explicit access to contextually relevant structured knowledge to the model and train it to use that knowledge. We present LM-CORE -- a general framework to achieve this -- that allows \textit{decoupling} of the language model training from the external knowledge source and allows the latter to be updated without affecting the already trained model. Experimental results show that LM-CORE, having access to external knowledge, achieves significant and robust outperformance over state-of-the-art knowledge-enhanced language models on knowledge probing tasks; can effectively handle knowledge updates; and performs well on two downstream tasks. We also present a thorough error analysis highlighting the successes and failures of LM-CORE.


Introduction
Large pre-trained language models (PLMs) (Peters et al., 2018;Devlin et al., 2019;Brown et al., 2020) have achieved state-of-the-art performance on a variety of NLP tasks. Much of this success can be attributed to the significant semantic and syntactic information captured in the contextual representations learned by PLMs. In addition to applications requiring linguistic knowledge, PLMs have also been useful for a variety of tasks involving factual knowledge and it has been shown that models such as BERT (Devlin et al., 2019) and T5 ) store significant world knowledge in their parameters (Petroni et al., 2019).
PLMs are typically fed a large amount of unstructured text which leads to the linguistic nuiances and world knowledge being captured in the model parameters. This implicit storage of the knowledge in the form of the parameter weights not only leads to poor interpretability while analyzing model predictions but also poses constraints on the amount of knowledge that can be stored. It is not practical to pack all the ever-evolving world knowledge in the language model parameters due to the great financial and environmental costs incurred by training of the PLMs. Further, since the PLMs acquire knowledge from the text corpora they are trained on, they tend to become sensitive to the contextual and linguistic variations (Jiang et al., 2020). Moreover, PLMs do not contain explicit grounding to real world entities, and hence, often find it difficult to recall factual knowledge . For example, the model may not be able to recall correct information and successfully complete the sentence, "The birthplace of Barack Obama is ", if the LM has seen this fact in a different context during training (e.g., "Barack Obama was born in Honolulu, Hawaii.").
Large scale structured knowledge bases (KBs) such as YAGO (Suchanek et al., 2007) and Wikidata (Vrandečić and Krötzsch, 2014) offer a rich resource of high quality structured knowledge that can provide the PLMs with explicit grounding to real world entities. Consequently, efforts have been made to integrate factual knowledge into PLMs and create entity-enhanced language models Zhang et al., 2019;Sun et al., 2020;Liu et al., 2020;Wang et al., 2021a,b Figure 1: Language Model Pre-Training with Contextually Relevant External Knowledge: 1 Using a sentence sampled from the pre-training corpus, an input (x) is created by selecting an entity mention at random from the potential mask candidates (underlined in red). 2 An NER tagger is then applied to the masked input sequence (x) to identify named entities (underlined in black). 3 For the identified entities, the Knowledge Retrieval module fetches the set T x of all the triples from the Knowledge Base and then 4 scores all the retrieved triples using input-triple and input-relation similarity (details in Section 3.2). 5 The top-k triples are fed to the Language Model encoder along with the input sequence (x) and the model is trained to predict the masked token.
these works either update the PLM parameters or modify the architecture to facilitate the storage of factual knowledge in the model layers and parameters, making it expensive to update knowledge.
In this work, we step back and ask -what if instead of focusing on storing the knowledge in the language model parameters, we provide the model with contextually relevant external knowledge and train it to use this knowledge? This approach offers several potential advantages -(i) we can utilize the already available high-quality large-scale knowledge bases such as YAGO and Wikidata; (ii) not all the knowledge needs to be packed in the parameters of the model resulting in lighter, smaller and greener models; and (iii) as new knowledge becomes available, the knowledge base can be updated independently of the language model. Our Contributions: We present LM-CORE, a framework for augmenting language models with contextually relevant external knowledge. The LM-CORE framework is summarized in Figure 1 and consists of a contextual knowledge retriever that fetches relevant knowledge from an external KB and passes it to the language model along with the input text. The language model is then trained with a modified entity-oriented masked language modeling objective (Section 3). Our proposed solution is simple, yet highly effective. Experiments on benchmark knowledge probes show that the pro-posed approach leads to significant performance improvements over base language models as well as state-of-the-art knowledge enhanced variants of the language models (Section 4). We find that with access to contextually relevant external knowledge, LM-CORE is less sensitive to the contextual variations in input text. We also show how LM-CORE can handle knowledge updates without any re-training and compare the performance of LM-CORE on two knowledge-intensive downstream tasks. Finally, we present an in-depth analysis of cases where our proposed approach gives incorrect answers paving the way for further research in this direction (Section 4.7).

Related Work
Augmenting Additional Knowledge in PLMs: Previous works on augmenting PLMs with additional knowledge can be grouped into two categories. One line of work adopts a retrieve and read framework where the model is trained to retrieve relevant information followed by a reading comprehension step to perform the downstream task (Lee et al., 2019a;Guu et al., 2020;Agarwal et al., 2021). While our proposal has similarities with this line of work in terms of retrieving the contextual knowledge, there are two major differences. First, most of these works consider external knowledge in the form of unstructured text (such as Wikipedia documents). However, extracting factual knowledge from unstructured text is hard and errorprone due to ambiguities in natural language and infrequent mentions of entities of interest . This issue can be alleviated by using a structured knowledge base where the knowledge is represented (mostly) unambiguously -each fact is a triple in the knowledge base. Further, these approaches employ explicit supervision during pretraining to train the model to fetch relevant passages from the text. This results in systems that are more complex and resource-hungry than the base PLMs used and also make it difficult to reuse or adapt the models to different sources of knowledge.
The second body of work has focused on injecting the factual knowledge directly into the model parameters by feeding more data to the model during pre-training (Poerner et al., 2020;. A promising direction explored recently is utilizing structured knowledge bases to augment Transformer-based LMs. ERNIE (Zhang et al., 2019) and KnowBERT  are notable efforts in this direction where the entity information from knowledge bases is explicitly linked with the input text during pre-training yielding entity-enhanced variants of BERT models with entity representations integrated within the Transformer layers. An alternative way of training entity-aware language models is illustrated by frameworks such as CoLAKE (Sun et al., 2020) and KEPLER (Wang et al., 2021b) that jointly learn the language and knowledge representations thereby producing language models augmented with factual knowledge and knowledge embeddings enhanced with textual context. However, these approaches, by design, will lead to larger and larger models to store the ever-growing abundant knowledge. Further, due to the strong coupling between the knowledge and language signals, updating or adding knowledge requires re-training of the model.
Examining the knowledge contained in PLMs: Petroni et al. (2019) posit that while training over large amounts of input text, PLMs may also be storing (implicit) relational knowledge in their parameters and proposed the Language Model Analysis (LAMA) framework to measure the relational knowledge stored in a PLM. Jiang et al. (2020) argue that due to the sensitivity of the PLMs on the input context, such manually created prompts are sub-optimal and might fail to retrieve facts that the PLM does know, thus provid- ing only a lower bound estimate of the knowledge contained in it. Subsequent work (Shin et al., 2020;Zhong et al., 2021) has attempted to generate better prompts in order to tighten this estimate. Poerner et al. (2020) introduced LAMA-UHN (UnHelpful-Names), a much harder subset of LAMA where the input probes provide little or no helpful contextual signals from other tokens in the probe, thus measuring the innate ability of the PLM to recall information.

LM-CORE: Knowledge Retrieval and Training Framework
Task setting and Overview: Consider a language model L (such as BERT and RoBERTa) and a knowledge base K = {t hrt =< h, r, t > |h, t ∈ E; r ∈ R}. Here, we consider the knowledge base K as a set of triples such that each triple t hrt represents the relationship r between entities h and t. E is the set of all the entities, and R is the set of all the relationship types present in the knowledge base. Given a text input x, the proposed LM-CORE framework retrieves a set of triples T x ∈ K such that the triples in T x are contextually relevant to x. The language model is then presented with the original input x and the contextually relevant knowledge in the form of T x and is trained to make predictions using this additional knowledge.
We posit that the model essentially needs to learn relevant semantic associations between natural language input text and various relation types present in the knowledge base. Identifying the correct relation types will help the model leverage the corresponding relevant facts in order to make an accurate prediction. This is accomplished via a modified Masked Language Modeling (MLM) (Devlin et al., 2019) pre-training objective. Figure 1 summarizes the complete workflow of our proposed LM-CORE framework and we describe the three main components in detail in the following sub-sections.

Entity span masking
Masked Language Modelling is a popular task used for training PLMs where the objective is to predict the masked token in the input sequence. In order to improve model's grounding to real world entities, previous works have adopted different strategies for explicitly masking entity information in the input text by using entity representations obtained by knowledge base embeddings (Zhang et al., 2019), using named entity recognizers (NER) and regular expressions (Guu et al., 2020), and verbalizing knowledge base triples (Agarwal et al., 2021). These approaches often result in noisy masks due to the limitations of underlying rules, and NER and entity linking systems. To overcome these limitations, we propose a novel way of creating high-quality and accurate entity masks by using Wikipedia as the base corpus for training. Note that in order to create entity masks, we need to identify corresponding entity mentions in the input text for which we utilize the human-annotated links in Wikipedia. The official style guidelines of Wikipedia require the editors to link mentions of topics to their corresponding Wikipedia pages. In Figure 2, the left textbox shows a screenshot of the Wikipedia article about Batman where various other related topics, or concepts, are linked to their corresponding Wikipedia pages (underlined in red in the figure, and displayed as blue anchor-text in Wikipedia). This information provides us with high-quality human annotation of entity mentions in the input text.
As illustrated in Figure 2, the underlined tokens (such as DC Comics, Bob Kane, Bill Finger) constitute the set of entity tokens that could be masked. For each such mask, we can also obtain the corresponding contextual knowledge from the external knowledge base (illustrated for Bob Kane in the right text box). By masking only the entity tokens (instead of randomly sampled words) and providing contextually relevant knowledge to the model retrieved from the knowledge base (as described in next subsection), we expect the model to learn to predict the masked entity tokens by utilizing the external knowledge.

Contextual Knowledge Retrieval
After preparing the masked input for training, the second component in our framework fetches contextually relevant knowledge to feed to the language model.Consider the sentence, "Warren Buf-fet is the chairman of [MASK]", where the masked token is Berkshire Hathaway. In the typical MLM setting, the model only has access to the linguistic and contextual clues present in the input text to predict the masked token. However, if contextually relevant information is available as additional input, the model can use it to output the correct token.
We consider the problem of finding contextually relevant facts given the input query text as an information retrieval (IR) problem and adopt a retrieve and re-rank approach that has empirically been found to perform well in a variety of tasks (Chen et al., 2017;Wang et al., 2017;Das et al., 2019;Yang et al., 2019). Recall the example input discussed above -"Warren Buffett is the chairman of [MASK]". Intuitively, in this input text, there are two important signals that the retriever needs to utilize -entity and relation information. The entity mention Warren Buffett indicates that we need to fetch facts related to Warren Buffet from the knowledge base. Typically, there are numerous facts related to a given entity in the knowledge base, especially for popular entities such as Warren Buffett. Thus, the retriever also needs to utilize the presence of the word chairman to retrieve facts (KB triples) representing the management or executive relation.
Given an input text, our retriever pipeline performs Named Entity Recognition (NER) to identify named entity mentions in the input text. We use the NER model from FLAIR (Akbik et al., 2019) to identify named entity mentions and then select KB entities having maximum overlap with the mention-span of the identified entities. For instance, if the input query is "Buffett was born in [MASK]", all of the entities containing Buffett -Warren_Buffett, Howard_Warren_Buffett, Howard_Graham_Buffett, Volcano_(Jimmy_ Buffett_song) etc. are selected, but if the query is "Warren Buffett was born in [MASK]", only the first two entities will be chosen). Once these entities are selected, all the facts from the KB involving these entities are retrieved (denoted by T x in Figure 1).
After retrieving the facts involving the entities mentioned in the input, we next need to rank these triples based on their relevance to the input. In order to measure the contextual relevance of a given triple t to the input x, we compute the following two scores. Query-Triple similarity: We obtain representations of the input text x as well as the triple t and compute the inner product of the representations to obtain the similarity score as follows.
Here, Emb(·) is obtained using the Sentence Transformer (Reimers and Gurevych, 2019). While it is straightforward to obtain representations of input x, sentence transformer can not be applied directly to KB triples. Application of KB embeddings such as TransE (Bordes et al., 2013) is also not feasible as then the representations of the input text and triples will be in different embedding spaces. To overcome this, we adopt a simple approach of verbalizing the knowledge base triples by concatenating the head entity, relationship and the tail entity, and obtain the representation of the verbalized triple from the sentence transformer. For example, the triple (War-ren_Buffett, hasOccupation, Investor) is verbalized as Warren Buffett has occupation Investor and is fed as input to the sentence transformer. Relation-based scoring: A triple is highly relevant for the input text if the triple represents the same relationship that is being talked about in the text. To capture this intuition, we embed all the relation types in the KB in the same embedding space as triples using the sentence transformer and compute the similarity between the input text and the relation type of the triple as follows.
where R is the set of all relations in the KB. The final relevance score for the triple t, relevance(x, t) is obtained by taking a product of the above two scores.
Based on this final score, we select the top-k triples that constitute the contextual knowledge to be fed as input along with x to the LM. We use k = 8 in this work (See Appendix 4.3 for effect of varying k). Some illustrative examples of the final retrieved knowledge base triples are presented in Appendix A.3.

Language Model Pre-training with Contextual Knowledge
With the masked training corpus and the module to fetch contextually relevant knowledge, we now train the model to utilize the additional contextual knowledge to predict the masked token. From the masked corpus, we select a sentence and a valid entity span is chosen at random out of all the potential spans in the sentence. We mask this span to create the input text x. We filter out sentences starting with pronouns such as he, she, her, and they as we observed that most of such sentences do not contain other useful signals to unambiguously predict the masked words. For instance, if the input example is -"He developed an interest in investing in his youth, eventually entering the Wharton School of the University of Pennsylvania" and Wharton School of the University of Pennsylvania is masked, the remaining words in the sentence are not providing any informative signals to the model to predict the masked tokens. Given the input sentence thus selected, the contextual knowledge retriever fetches the relevant triples from the knowledge base. The representations of the input sentence and the retrieved triples are then concatenated and fed to the model and the model is trained to minimize the following MLM loss.
tokens in x and ind m is the index of the m th masked token.
With the additional contextual information available to the model, we expect the model to learn the associations between linguistic cues in the input text and relevant relationship information in the triples. For example, we expect the model to associate different ways in which someone's date of birth could be mentioned in natural language (such as X was born on, the birthday of X is, and numerous other linguistic variations) to the KB relation birthDate and utilize the information from the corresponding triple. Note that since the types of relations in the knowledge base are relatively small in number, and do not change often, we expect the model to generalize well and be more robust to linguistic variations.

Experiments and Discussions
Data Sources and Pre-processing: We create our pre-training corpus using the December 20, 2018 snapshot of English Wikipedia that contains about 5.5M documents. Processing all the articles following the masking strategy described in Section 3.1 resulted in a total of ∼46.3M sentences with valid masks, from which we randomly sample sentences to create input examples.
In order to illustrate the general nature of LM-CORE, we used two different PLMs as our Table 1: Mean precision at one (P@1) of various models on LAMA probe. We group all the models based on the base language model used (BERT or RoBERTa). For LM-CORE, (·, ·) indicates the variant -(b, r corresponds to BERT and RoBERTa, respectively, and y, w indicate YAGO and Wikidata5M, respectively). Best results in each column are highlighted in bold and the second best performance is underlined  (Suchanek et al., 2007) and preprocess it to obtain our retrieval corpus consisting of roughly 17M triples spanning over 4.9M entities and 131 unique relations. For Wikidata, we used the Wikidata5M version (Wang et al., 2021b) that consists of roughly 21M triples covering 821 unique relations and 4.8M entities. Further details regarding retrieval corpus generation and processing can be found in the Appendix (Section A.2). For computing triple representations for retrieval (Section 3.2), we concatenate the subject (head), relation, and object (tail) of triples and embed them using the Sentence Transformers (Reimers and Gurevych, 2019) and obtain the 768-dimensional embeddings (same as LM encoder dimensions).

Does External Knowledge Help PLMs in Knowledge Intensive Tasks?
We now present an analysis of how much, and if, having access to external knowledge can help PLMs in knowledge-intensive tasks. A popular way of assessing a model's ability to perform at such tasks is by using benchmark knowledge probes. We use the LAMA probe (Petroni et al., 2019) (Wang et al., 2021b) and CoLAKE (Sun et al., 2020) (based on RoBERTa) as the representative knowledge enhanced language models. Both KEPLER and Co-LAKE have used Wikidata5M as the knowledge base. We used author provided code and checkpoints for obtaining the reported numbers. For LM-CORE, we use four variants with different knowledge base and language encoder combinations as described above.
We observe that our approach of providing external knowledge to the PLMs leads to substantially improved performance over the base language models and their SoTA knowledge enhanced variants. LM-CORE(b,w) achieves P@1 of 42.83% compared to 25.44% for BERT-large. Likewise, LM-CORE(r,w) achieves a P@1 of 41.96% significantly outperforming RoBERTa-large (24.24%). We also report the numbers on the four different subsets of LAMA revealing interesting insights. For all the models considered, we note that the performance on T-REx subset is higher than the Google-RE subset. We attribute this to the nature of knowledge required for probes in the four subsets. Note especially the column for Date of Birth (DoB) in the Table. All the models, except for LM-CORE(b,y) and LM-CORE(r,y) perform extremely poorly. This is because the Wikidata5M KB does not have date entity type and hence, the poor performance of models using Wikidata. We also note

Sensitivity to Contextual Signals in Input
PLMs are often sensitive to the linguistic variations in the input and are overly reliant on the surface form of entity names for making its predictions Poerner et al. (2020). For example, BERT can predict that a person with an Italian-sounding name was born in Italy even if this is factually incorrect. In order to evaluate the sensitivity and robustness of different models, we report the P@1 numbers for the LAMA-UHN (UnHelpfulNames) probing benchmark (Table 2) -a much harder subset of LAMA where input probes with helpful entity names are removed and the PLM has little or no helpful contextual signals from other tokens in the probe. We observe that the LM-CORE variants significantly outperform the base language models and their knowledge enhanced variants. Further, note that while all the baseline models suffer a significant fall in performance (expected due to the hardness of LAMA-UHN), the drop in performance of LM-CORE variants is much less. This indicates that having access to relevant external knowledge helps reduce the dependence on linguistic signals and results in the robust outperformance of LM-CORE variants.

Effect of Varying Number of Input Triples to LM-CORE
We analyze the effect of varying the number of candidates (k) during retrieval in Figure 3. We discuss with respect to Google-RE and T-REx subsets as our factual knowledge triples are most relevant for answering queries in these subsets (in comparison to commonsense queries in ConceptNet). We plot the Precision@1 (P@1) against increasing k values from 1 to 10 for LM-CORE(b,y) and LM-CORE(r,w) variants. We do not observe any consistent optimal k value across variants and data subsets. To add, there is no significant difference between P@1 values as k varies from 4 to 10. Hence, in order to maximize our recall while keeping the computational expense in mind, we select k = 8 for our experiments.

Role of LM-CORE Pre-training and Retrieved Knowledge
We now study the role LM-CORE pre-training plays in helping the model access and utilize the retrieved knowledge and ensure that the model does not just rely on the knowledge stored in its parameters. We also study the effect of augmenting the base LMs with knowledge retrieved by LM-CORE. In addition to providing an insight into the quality of the knowledge retrieved by LM-CORE, this will also help us better understand the ability of LM-CORE to utilize the retrieved knowledge. We consider the following four variants on the LAMA probe (Table 3)  We observe that LM-CORE(r,w)'s performance (41.69 P@1) significantly exceeds RoBERTabase's performance using the same triples in input (30.06 P@1) , demonstrating that our training procedure equips the model with the capability of identifying and using relevant external knowledge effectively. There is a large drop in performance (from 41.69 P@1 to 19.51 P@1) when LM-CORE(r,w) is provided with random triples in input. This shows that the model exclusively accesses external knowledge to answer queries correctly. While the performance drops, it is important to note that the P@1 is similar to RoBERTa-base (20.46 P@1), highlighting that our training procedure does not lead to catastrophic forgetting and the model is able to rely on the knowledge stored in its parameters when semantically relevant triples are not provided in the input. Finally, although RoBERTa-base when augmented with contextually relevant triples does not perform competitively with LM-CORE, it demonstrates considerable improvement over the base RoBERTa model. This shows that high-quality relevant external knowledge has the potential to improve factual prediction, further reinforcing our motivation to train models to efficiently retrieve and use this knowledge.

Downstream Tasks
We consider two downstream tasks to study the effectiveness of LM-CORE for different NLP applications. We take Zero-Shot Relation Extraction (ZSRE) (Levy et al., 2017) and open-domain question answering over Web Questions (WQ) (Berant et al., 2013) dataset as the representative knowledge-intensive tasks. Tables 4 and 5 report the performance of LM-CORE and various other baselines for the two tasks, respectively. We use the LM-CORE(b,w) variant for these experiments as most baselines use BERT as the LM and Wikipedia as the knowledge base. For the ZSRE task, we use the data splits and evaluation systems provided as part of the KILT benchmark (Petroni et al., 2021). We find that for the ZSRE task, LM-CORE achieves a significantly higher F-1 score (74.80) compared to the second-best RAG model (49.95). Also, note that the online evaluator for the task considers exact string match (including casing, punctuations, etc.) for computing accuracy numbers but not for computing other metrics. Hence, the reported accuracy number for LM-CORE represent a lower bound as we don't have access to the same pre-processing pipeline to process its output. For the WQ dataset, we find that LM-CORE outperforms BERT with BM25 and neural retrievers, and the DrQA system. We observe that LM-CORE is outperformed by ORQA, designed explicitly for this task, and RAG (a retrieval augmented generative model). However, do note that all the models except LM-CORE have access to much larger knowledge source (complete Wikipedia corpus. ≈ 2B words), whereas LM-CORE only has access to the KB triples (21M triples, ≈ 140M words). As we show in the following subsection, with access to additional external knowledge, the performance of LM-CORE can improve significantly.  randomly sample 100 instances from the LAMA probe where the model failed and manually analyze these instances to identify the cases where the corresponding fact was not present in the YAGO KB. There were a total of of 41 such instances and we manually added the correct facts needed to answer the corresponding questions in YAGO. We then presented the 41 inputs again to the model with the updated KB. This time, the model used this newly added knowledge and was able to correct its prediction without any re-training for 36 out of 41 cases (87.8%). As discussed in the following sub-section (4.7), a majority of errors made by LM-CORE are due to missing facts in the KB and we expect that most of such errors can be corrected by having access to a larger, more comprehensive Knowledge Base.

Discussions
We now present some representative examples to illustrate the successes and failures of LM-CORE. Consider a test probe from the Google-RE subset of LAMA -Phil Mogg is a member of . Here, the correct output token is UFO, the band and BERT model incorrectly predicts parliament as the output token. This highlights the sensitivity of PLMs on context; BERT's prediction seems to be derived from its memorization of the frequently encountered phrase member of parliament during pretraining. We argue that the contextual knowledge retrieved by LM-CORE which includes the relevant fact <Phil Mogg; member of; UFO (band)> has helped the model to produce the correct output. We present more such successful examples in the Appendix (Tables 10 and 11).
Next, we analyzed the cases where the proposed framework produced incorrect output and observed three major reasons for erros -(i) the required knowledge was not present in the knowledge base; (ii) the required knowledge was not retrieved despite being present in the knowledge base; and (iii) the system made errors after retrieving the rele-vant knowledge. The first problem cause could be addressed by enhancing the knowledge base as shown in Section 4.6. The other two causes of failure highlight the scope of improvement in our retrieval module as well as pre-training module, where further training could help the model make better use of the retrieved knowledge. Some representative examples of these different cases are presented in the Appendix (Table 12). Finally, we noticed some errors that could be attributed to the characteristics of the LAMA probe. Specifically, there are input probes that refer to entities without providing any additional context for disambiguation. For example, the sentence "James Johnson was born in " has no clues to determine whether the prompt is referring to the basketball player, Virginia congressman, or the Governor of Georgia with this name. We also noticed certain probes where there are multiple correct completions and the benchmark considers only one of these as the correct answer. For example, "Michelangelo is a by profession" can be correctly completed by poet, painter or architect, but the evaluation considers only poet as the correct answer. We also noticed some input examples with highly unambiguous language. For example, "X died in ", can refer to either X's place of death or date of death but only the former is accepted as the correct answer. Lastly, there are cases where slight (and correct) variations of the expected answer are evaluated as incorrect by the probe. For example, for the prompt "Harashima is citizen." Japan is provided as the correct answer while the prediction made by LM-CORE (Japanese) is considered incorrect.

Conclusion
We presented LM-CORE, a framework to train language models with contextually relevant external knowledge. We show that having access to external knowledge leads to significant and robust outperformance over base language models and their knowledge enhanced versions on knowledge probing and two downstream tasks. We also showed how LM-CORE can handle knowledge updates and presented a thorough error analysis that helped us identify possible directions of future work.

A.1 Pre-Training corpus
We use the English Wikipedia (December 20, 2018) snapshot 3 to create our pre-training corpus and WikiExtractor 4 to process the dumps. This Wikipedia version contains about 5.5M documents. We retain the hyperlinks while extracting Wikipedia articles as we use them for creating entity masks (Section 3.1). Following the entity masking strategy described in Section 3.1, we obtain our pre-training corpus which contains ∼46.3M sentences in total. During pre-training the base LMs, we sample sentences containing valid masks. The pre-training corpus is maintained consistent across all LM-CORE variants.

A.2 Retrieval corpus
We use two popular knowledge bases (KBs) in LM-CORE -YAGO 4 (Suchanek et al., 2007) and Wikidata5M (Wang et al., 2021b). The statistics of the KBs -number of facts, entities and relations can be found in Table 6. We describe the preprocessing steps followed to obtain the respective final retrieval corpora in the following subsections.

A.2.1 YAGO
YAGO 4 5 is in RDFS format. YAGO facts are derived from Wikidata, however, all the entities are arranged in a taxonomy mapped to schema.org.
We pre-process YAGO to remove triples involving relationships such as image, logo and url that point to meta-data such as images and other files. We also filter out triples that point to RDF literals or Wikidata URLs.

A.2.2 Wikidata5M
We use the Wikidata5M subset of Wikidata as made available by Wang et al. (2021b) 6 . This subset of Wikidata is aligned with Wikipedia such that each 3 https://archive.org/ download/enwiki-20181220/ enwiki-20181220-pages-articles.xml.bz2 4 https://github.com/attardi/ wikiextractor entity in Wikidata5M has a corresponding entry in Wikipedia. We used the raw graph as provided in the dataset, the statistics of which are reported in Table 6.

A.3 Example Retrieved Triples
We provide a closer look into our pre-training approach by showing examples of masked input sentences and the retrieved triple candidates from the knowledge base (Table 7). We observe that the facts retrieved are highly relevant for predicting the masked entities in the input context.

B Additional Experiments
B.1 How LM-CORE Compares with Other Retrieval Paradigms Table 8 also reports results of REALM (Guu et al., 2020) -a retrieval-based language model that retrieves relevant documents from a text corpus during pre-training. We observe that LM-CORE outperforms REALM on the ConceptNet, DoB (Google-RE), and 1-1 (T-REx) subsets, while REALM outperforms the proposed solution in other subsets of the LAMA probe. We specifically highlight an absolute 15 points improvement on the date-of-birth relation despite REALM using explicit date masks while training whereas our training corpus only has entity masks. This indicates that our model can use the the contextual knowledge provided by the retriever module even though it is not explicitly shown such knowledge during training. Note that while REALM is similar to our proposed solution as far as the idea of retrieving relevant knowledge is concerned, the key difference in the two approaches lies in the source of knowledge being used. REALM relies on an unstructured text corpus (Wikipedia) as the source of knowledge and employs a computationally complex retrieve and read paradigm requiring additional training of the knowledge retriever model. Our proposed solution, on the other hand, uses structured knowl-edge which offers the advantage of being (almost) unambiguous and less resource-hungry compared to unstructured text. We present the resource requirements of our approach and REALM in Table  9. Note that the size of the external knowledge (in number of words) used by REALM is an order of magnitude greater, and requires three times the number of parameters compared to our model. Furthermore, REALM was trained for 200K steps with a batch size of 512 on an 80 TPU cluster, whereas our proposed solution is much more efficient being trained for 1K steps with a batch size of 512 on a machine with 8 Nvidia A100 GPUs. This computational efficiency of our proposed solution allows us to continue further work on improving our performance by enhancing the structured knowledge base and bridge the performance gap with more complex and computationally expensive models such as REALM.

C LAMA Evaluation
We use the official LAMA data code 7 for evaluating P@1 numbers in Table 1. All the BERT-based models are evaluated using this repository. The LAMA code provides functionality for evaluating RoBERTa models trained in the fairseq framework. Hence, we evaluate RoBERTa-base, RoBERTalarge and KEPLER (Wang et al., 2021b) using this code. The KEPLER repo also points to this code for evaluation. CoLAKE (Sun et al., 2020), has adapted the official code 8 to allow huggingface transformer checkpoints as input, and hence this code is used for CoLAKE, LM-CORE(r,y) and LM-CORE(r,w) evaluation. We ensure the model vocabularies and data is consistent across evaluation. We have used author provided/recommended code and publicly available checkpoints from the official code repositories for all baselines.

C.2 LAMA: Error Analysis
We present various failure cases for LM-CORE in Table 12. These are representative of the type of errors we encountered, however, we observed that majority of the errors resulted due to correct facts missing from the KB.

C.3 Complete LAMA-UHN results
The complete LAMA-UHN results over all subsets of Google-RE and T-REx can be found in Table 13.

D Downstream Evaluation
We discuss the experimental setup and hyperparameter settings for our downstream tasks.

D.1 Zero Shot RE
We consider the open domain version of Zero Short RE (Levy et al., 2017) from Petroni et al. (2021). The dataset is split into three disjoint sets -train (147,909 samples, 84 relations), dev (3,724 samples, 12 relations) and test (4,966 samples, 24 re-lations). The systems are evaluated on relations never seen during training.
We fine tune our model for 2 epochs with a batch size of 96. We use the Adam (Kingma and Ba, 2015) optimizer and a learning rate of 3e-5. We performed multiple trials by tuning the number of epochs in {1, 2, 5}.

D.2 Web Questions
Web Questions (Berant et al., 2013) was created using questions that were sampled from the Google Suggest API. We used the same splits as Lee et al. (2019b) with training, dev and test sets containing 3417, 361 and 2032 samples respectively.
We fine tuned our model for 20 epochs -we experimented with number of epochs in {10, 20, 30, 50}. We use the Adam (Kingma and Ba, 2015) optimizer and a learning rate of 3e-5.

E Risks Statement
This work considers training of large language models using large textual corpora as well as structured knowledge bases. The model learns the nuances of the language and correlations between different real-world entities based on the data that is being used for training the model. Hence, there is a chance that the biases and noise in the training data will creep into the model parameters as well that can lead to a biased model behavior. We need to be careful in deploying the model and extrapolating the output of the model in applications such as search, conversational systems and recommendation systems where model's inherent biases can lead to catastrophic impacts on the user.