Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

Previous literatures show that pre-trained masked language models (MLMs) such as BERT can achieve competitive factual knowledge extraction performance on some datasets, indicating that MLMs can potentially be a reliable knowledge source. In this paper, we conduct a rigorous study to explore the underlying predicting mechanisms of MLMs over different extraction paradigms. By investigating the behaviors of MLMs, we find that previous decent performance mainly owes to the biased prompts which overfit dataset artifacts. Furthermore, incorporating illustrative cases and external contexts improve knowledge prediction mainly due to entity type guidance and golden answer leakage. Our findings shed light on the underlying predicting mechanisms of MLMs, and strongly question the previous conclusion that current MLMs can potentially serve as reliable factual knowledge bases.


Introduction
Recently, pre-trained language models (Peters et al., 2018;Devlin et al., 2019;Brown et al., 2020) have achieved promising performance on many NLP tasks. Apart from utilizing the universal representations from pre-trained models in downstream tasks, some literatures have shown the potential of pretrained masked language models (e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b)) to be factual knowledge bases (Petroni et al., 2019;Bouraoui et al., 2020;Jiang et al., 2020b;Shin et al., 2020;Jiang et al., 2020a;Wang et al., 2020;Kassner and Schütze, 2020a;Kassner et al., 2020). For example, to extract the birthplace of Steve Jobs, we can query MLMs like BERT with "Steve Jobs was born in [MASK]", where Steve Jobs is the subject Paradigm Figure 1: This paper explores three different kinds of factual knowledge extraction paradigms from MLMs, and reveal the underlying predicting mechanisms behind them. of the fact, "was born in" is a prompt string for the relation "place-of-birth" and [MASK] is a placeholder for the object to predict. Then MLMs are expected to predict the correct answer "California" at the [MASK] position based on its internal knowledge. To help MLMs better extract knowledge, the query may also be enriched with external information like illustrative cases (e.g., (Obama, Hawaii)) (Brown et al., 2020) or external context (e.g., Jobs lives in California) (Petroni et al., 2020). Some literatures have shown that such paradigms can achieve decent performance on some benchmarks like LAMA (Petroni et al., 2019).
Despite some reported success, currently there is no rigorous study looking deeply into the underlying mechanisms behind these achievements. Besides, it is also unclear whether such achievements depend on certain conditions (e.g., datasets, domains, relations). The absence of such kind of studies undermines our trust in the predictions of MLMs. We could neither determine whether the predictions are reliable nor explain why MLMs make a specific prediction, and therefore significantly limits MLMs' further applications and improvements.
To this end, this paper conducts a thorough study on whether MLMs could be reliable factual knowledge bases. Throughout our investigations, we analyze the behaviors of MLMs, figure out the critical factors for MLMs to achieve decent performance, and demonstrate how different kinds of external information influence MLMs' predictions. Specifically, we investigate factual knowledge extraction from MLMs 2 over three representative factual knowledge extraction paradigms, as shown in Figure 1: • Prompt-based retrieval (Petroni et al., 2019;Jiang et al., 2020b;Shin et al., 2020), which queries MLM for object answer only given the subject and the corresponding relation prompt as input, e.g., "Jobs was born in [MASK]." • Case-based analogy (Brown et al., 2020;Madotto et al., 2020;Gao et al., 2020), which enhances the prompt-based retrieval with several illustrative cases, e.g., "Obama was born in Hawaii. [SEP] Jobs was born in [MASK]." • Context-based inference (Petroni et al., 2020;Bian et al., 2021), which augments the prompt-based retrieval with external relevant contexts, e.g., "Jobs lives in California.
[SEP] Jobs was born in [MASK]." Surprisingly, the main conclusions of this paper somewhat diverge from previous findings in published literatures, which are summarized in Figure 1. For prompt-based paradigm ( § 3), we find that the prediction distribution of MLMs is significantly prompt-biased. Specifically, we find that prompt-based retrieval generates similar predictions on totally different datasets. And predictions are spuriously correlated with the applied prompts, rather than the facts we want to extract. Therefore, previous decent performance mainly stems from the prompt over-fitting the dataset answer distribution, rather than MLMs' knowledge extraction ability. Our findings strongly question the conclusions of previous literatures, and demonstrate that current MLMs can not serve as reliable knowledge bases when using prompt-based retrieval paradigm.
2 This paper shows the experimental results on BERT-large because previous work has shown that it can achieve the best performance on factual knowledge extraction among all MLMs. In the Appendix, we also report the experimental results on RoBERTa-large, which also reach the main conclusions reported in the paper.
For case-based paradigm ( § 4), we find that the illustrative cases mainly provide a "type guidance" for MLMs. To show this, we propose a novel algorithm to induce the object type of each relation based on Wikidata 3 taxonomy. According to the induced types, we find that the performance gain brought by illustrative cases mainly owes to the improvement on recognizing object type. By contrast, it cannot help MLMs select the correct answer from the entities with the same type: the rank of answer within its entity type is changed randomly after introducing illustrative cases. That is to say, under the case-based paradigm, although MLMs can effectively analogize between entities with the same type, they still cannot well identify the exact target object based on their internal knowledge and the provided illustrative cases.
For context-based paradigm ( § 5), we find that context can help the factual knowledge extraction mainly because it explicitly or implicitly leaks the correct answer. Specifically, the knowledge extraction performance improvement mainly happens when the introduced context contains the answer. Furthermore, when we mask the answer in the context, the performance still significantly improves as long as MLMs can correctly reconstruct the masked answer in the remaining context. In other words, in these instances, the context itself servers as a delegator of the masked answer, and therefore MLMs can still obtain sufficient implicit answer evidence even the answer doesn't explicitly appear.
All the above findings demonstrate that current MLMs are not reliable in factual knowledge extraction. Furthermore, this paper sheds some light on the underlying predicting mechanisms of MLMs, which can potentially benefit many future studies.
Currently, there are three main paradigms for knowledge extraction from PLMs: prompt-based retrieval (Schick and Schütze, 2021;Li and Liang, 2021), case-based analogy (Schick and Schütze, 2020a,b), and context-based inference. For promptbased retrieval, current studies focus on seeking better prompts by either mining from corpus (Jiang et al., 2020b) or learning using labeled data (Shin et al., 2020). For case-based analogy, current studies mostly focus on whether good cases will lead to good few-shot abilities, and many tasks are tried (Brown et al., 2020;Madotto et al., 2020;Gao et al., 2020). For context-based inference, current studies focus on enhancing the prediction by seeking more informative contexts, e.g., for knowledge extraction (Petroni et al., 2020) and Com-monsenseQA (Bian et al., 2021). However, there is no previous work which focuses on systematically study the underlying predicting mechanisms of MLMs on these paradigms.

Prompt-based Retrieval
The prompt-based retrieval extracts factual knowledge by querying MLMs with (subject, prompt, [MASK]). For example, to extract the "place-of-birth" of Steve Jobs, we could query BERT with "Steve Jobs was born in [MASK]." and the predicted "California" would be regarded as the answer. We consider three kinds of prompts: the manually prompts T man created by Petroni et al. (2019), the mining-based prompts

Overall Conclusion
Conclusion 1. Prompt-based retrieval is promptbiased. As a result, previous decent performance actually measures how well the applied prompts fit the dataset answer distribution, rather than the factual knowledge extraction ability from MLMs.  Specifically, we conduct studies and find that 1) Prompt-based retrieval will generate similar responses given quite different datasets. To show this, we construct a new dataset from Wikidata -WIKI-UNI, which have a totally different answer distribution from the widely-used LAMA 4 dataset (Petroni et al., 2019). However, we find that the prediction distributions on WIKI-UNI and LAMA are highly correlated, and this spurious correlation holds across different prompts. Such results reveal that there is just a weak correlation between the predictions of MLMs and the factual answer distribution of the dataset. 2) The prediction distribution is dominated by the prompt, i.e., the prediction distribution using only (prompt, [MASK]) is highly correlated to the prediction distribution using (subject, prompt, [MASK]). This indicates that it is the applied prompts, rather than the actual facts, determine the predictions of MLMs. 3) The performance of the prompt can be predicted by the divergence between the prompt-only distribution and the answer distribution of the dataset. All these findings reveal that previous decent performance in this field actually measures the degree of prompt-dataset fitness, rather than the universal factual knowledge extraction ability.

Different Answers, Similar Predictions
Finding 1. Prompt-based retrieval will generate similar responses to quite different datasets.
A reliable knowledge extractor should generate   Table 1: Average percentage of instances being covered by top-k answers or predictions. For answer distribution, top-5 objects in LAMA cover 6.2 times of instances than that in WIKI-UNI, however, for prediction distribution, they are almost the same. As a result, the precision is significantly dropped in WIKI-UNI. different responses to different knowledge queries.
To verify whether MLMs meet this standard, we manually construct a new dataset -WIKI-UNI, which has a comparable size but totally different answer distribution to LAMA, and then compare the prediction distributions on them. For a fair comparison, we follow the construction criteria of LAMA: we use the same 41 relations, filter out the queries whose objects are not in the MLMs' vocabulary. Compared with LAMA, the major difference is that WIKI-UNI has a uniform answer distribution, i.e., for each relation, we keep the same number of instances for each object. Please refer to Appendix for more construction details. Figure 2a shows the answer distributions of LAMA and WIKI-UNI on relation "place-of-birth". We can see that the answers in LAMA are highly concentrated on the head object entities, while the answers in WIKI-UNI follow a uniform distribution. Given LAMA and WIKI-UNI, we investigate the predicting behaviors of MLMs. Surprisingly, the prediction distributions on these two totally different datasets are highly correlated. Figure 2b shows an example. We can see that the prediction distribution on WIKI-UNI is very similar to that on LAMA. And these two distributions are both close to the answer distribution of LAMA but far away from the answer distribution of WIKI-UNI.
To investigate whether this spurious correlation is a common phenomenon, we analyze the Pearson correlation coefficient between prediction distributions on LAMA and WIKI-UNI across different relations and three kinds of prompts. The boxplot in Figure 3 shows the very significant correlation between the prediction distributions on LAMA and WIKI-UNI: on all three kinds of prompts, the correlation coefficients exceed 0.8 in more than half of relations. These results demonstrate that promptbased retrieval will lead to very similar prediction distributions even when test sets have vastly different answer distributions. Furthermore, we find that the prediction distribution obviously doesn't correspond to the answer distribution of WIKI-UNI. From Table 1, we can see that on average, the top-5 answers of each relation in WIKI-UNI cover only 7.78% instances. By contrast, the top-5 predictions of each relation in WIKI-UNI cover more than 52% instances, which is close to the answer distribution and prediction distribution on LAMA. As a result, the performance on WIKI-UNI (mean P@1: 16.47) is significantly worse than that on LAMA (mean P@1: 30.36). In conclusion, the facts of a dataset cannot explain the predictions of MLMs, and therefore previous evaluations of the MLMs' ability on factual knowledge extraction are unreliable.

Prompts Dominates Predictions
Finding 2. The prediction distribution is severely prompt-biased.
To investigate the underlying factors of the predicting behavior of MLMs, we compare the promptonly prediction distribution using only (prompt, [MASK]) and the full prediction distribution using (subject, prompt, [MASK]). To obtain the promptonly distribution, we mask the subject and then use ([MASK], prompt, [MASK]) to query MLMs (e.g., [MASK] was born in [MASK]). Because there is no subject information in the input, MLMs can only depend on applied prompt's information to make the prediction at the second [MASK]. Therefore, we regard the probability distribution at the second [MASK] symbol as the prompt-only distribution.
After that, we analyze the correlations between the prompt-only distribution and the prediction distribution on WIKI-UNI dataset. Figure 4 shows the boxplot. On all three kinds of prompts, correlation coefficients between the prompt-only distribution and the prediction distribution on WIKI-UNI exceed 0.6 in more than half of relations. This demonstrates that in these relations, the promptonly distribution dominates the prediction distribution. Combining with the findings in Section 3.2, we can summarize that the prompt-based retrieval is mainly based on guided guessing, i.e., the predictions are generated by sampling from the promptbiased distribution guided by the moderate impact of subjects.
Note that among a minor part of relations, the correlations between the prompt-only distribution and the prediction distribution are relatively low. We find that the main reason is the type selectional preference provided by the subject entities, and Section 4 will further discuss the impact of this type-guidance mechanism for MLMs.

Better Prompts are Over-Fitting
Finding 3. "Better" prompts are the prompts fitting the answer distribution better, rather than the prompts with better retrieval ability.
Some previous literatures attempt to find better prompts for factual knowledge extraction from MLMs. However, as we mentioned above, the prompt itself will lead to a biased prediction distribution. This raises our concern that whether the found better prompts are really with better knowledge extraction ability, or the better performance just come from the over-fitting between the promptonly distribution and the answer distribution of the test set.
To answer this question, we evaluate the KL divergence between the prompt-only distribution and the answer distribution of LAMA on different kinds of prompts. The results are shown in Table 2. We find that the KL divergence is a strong indicator of the performance of a prompt, i.e., the smaller the KL divergence between the promptonly distribution and the answer distribution of the test set is, the better performance the prompt achieve. Furthermore, Table 3 shows several comparisons between different kinds of prompts and   their performance on LAMA. We can easily observe that the better-performed prompts are actually over-fitting the dataset, rather than better capturing the underlying semantic of the relation. As a result, previous prompt searching studies are actually optimized on the spurious prompt-dataset compatibility, rather than the universal factual knowledge extraction ability.

Case-based Analogy
The case-based analogy enhances the prompt-based paradigm with several illustrative cases. For example, if we want to know the "place-of-birth" of Steve Jobs, we would first sample cases such as (Obama, place-of-birth, Hawaii), and combine them with the original query. In this way, we will use "Obama was born in Hawaii.
[SEP] Steve Jobs was born in [MASK]." to query MLMs.

Overall Conclusion
Conclusion 2. Illustrative cases guide MLMs to better recognizing object type, rather than better predicting facts.
To show this, we first design an effective algorithm to induce the type of an entity set based on Wikidata taxonomy, which can identify the object type of a relation. According to the induced types, we find that the benefits of illustrative cases mainly stem from the promotion of object type recognition. In other words, case-based analogy guides MLMs with better type prediction ability but contributes  Figure 5: Illustration of our type induction algorithm. The numbers on the right of each type indicate how many entities does the type cover. The type of an entity set is the finest grained type in the type graph that can cover a sufficient number of the instances in the entity set, which is City in the example.
little to the entity prediction ability. In the following, we first illustrate our type inducing algorithm, and then explain how we reach the conclusion.

Entity Set Type Induction
To induce the object type of a relation, we first collect all its objects in LAMA and form an entity set. Then we induce the type of an entity set by designing a simple but effective algorithm. The main intuition behind our algorithm is that a representative type should be the finest grained type that can cover a sufficient number of the instances in the entity set. Figure 5 shows an example of our algorithm. Given a set of entities in Wikidata, we first construct an entity type graph (ETG) by recursively introducing all ancestor entity types according to the instance-of and subclass-of relations. For the example in Figure 5, Chicago is in the entity set and is an instance-of Big City. Big City is a subclass-of City. As a result, Chicago, Big City and City will all be introduced into ETG. Then we apply topological sorting (Cook, 1985) to ETG to obtain a Fine-to-Coarse entity type sequence. Finally, based on the sequence, we select the first type which covers more than 80% of entities in the entity set (e.g., City in Figure 5). Table 4 illustrates several induced types, and please refer to the Appendix for details.

Cases Help Type Recognition
Finding 4. Illustrative cases help MLMs to better recognize the type of objects, and therefore improve factual knowledge extraction.
For case-based analogy, the first thing we want to know is whether illustrative cases can improve the knowledge extraction performance. To this end, for each (subject, relation) query in LAMA, we  Figure 6: Percentages on the change of overall rank (among all candidates) and the in-type rank (among candidates with the same type) of golden answer. We can see that the illustrative cases mainly raise the overall rank but cannot raise the in-type rank, which means the performance improvements mainly come from better type recognition.
randomly sample 10 illustrative cases. To avoid answer leakage, we ensure the objects of these cases don't contain the golden answer of the query. Then we use (cases, subject, prompt, [MASK]) as the analogous query to MLMs.
Results show that case-based analogy can significantly improve performance. After introducing illustrative cases, the mean precision increases from 30.36% to 36.23%. Besides, we find that 11.81% instances can benefit from the introduced cases and only 5.94% instances are undermined. This shows that case-based analogy really helps the MLMs to make better predictions.
By analyzing the predicting behaviors, we observe that the main benefit of introducing illustrative cases comes from the better type recognition. To verify this observation, we investigate how the types of predictions changed after introducing the illustrative cases. Table 4 shows the results on relations whose precision improvement is more than 10% after introducing illustrative cases. From the table, it is very obvious that illustrative cases enhance the factual knowledge extraction by improving type prediction: 1) For queries whose predictions are correctly reversed (from wrong to right), the vast majority of them stems from the revised type prediction; 2) Even for queries whose predictions are mistakenly reversed (from right to wrong), the type of the majority of predictions still remains correct. In conclusion, introducing illustrative cases can significantly improve the knowledge extraction ability by recognizing the object type more accurately. That is, adding illustrative cases will provide more guidance for object type.  Table 4: Detailed analysis on relations where the mean precision increased more than 10%. Precision ∆ and Type Prec. ∆ represents the precision changes on the answer and the type of the answer respectively. "w/ Type Change" and "w/o Type Change" represents the type of prediction changed/unchanged before/after introducing illustrative cases. "-" indicate there is no queries whose predictions are mistakenly reversed.

Cases do not Help Entity Prediction
Finding 5. Illustrative cases are of limited help for selecting the answer from entities of the same type.
To show this, we introduce a new metric referred as in-type rank, which is the rank of the correct answer within the entities of the same type for a query. By comparing the in-type rank in prompt-based and case-based paradigm, we can evaluate whether the illustrative cases can actually help better entity prediction apart from better type recognition. Figure 6 shows the percentages on the change of overall rank (among all candidates) and the in-type rank (among candidates with the same type) of golden answer. Unfortunately, we find that illustrative cases are of limited help for entity prediction: the change of in-type rank is nearly random. The percentages of queries with Raised/Unchanged/Dropped in-type rank are nearly the same: 33.05% VS 35.47% VS 31.47%. Furthermore, we find that the MRR with the type only changed from 0.491 to 0.494, which shows little improvement after introducing illustrative cases. These results show that the raises of overall rank of golden answer are not because of the better prediction inside the same type. In conclusion, illustrative cases cannot well guide the entity prediction, and they mainly benefit the factual knowledge extraction by providing guidance for object type recognition.

Context-based Inference
The context-based inference augments the promptbased paradigm with external contexts. For example, if we want to know the "place-of-birth" of Steve Jobs, we can use the external context "Jobs was from California.", and form a context-enriched  Table 5: Comparison between prompt-based and context-based paradigms grouped by whether the answer presents or absents in the context. We can see that only contexts containing the answer can significantly improve the performance.
query "Jobs was from California.
[SEP] Steve Jobs was born in [MASK]." to query MLMs. Specifically, we use the same context retrieval method as Petroni et al. (2020): for each instance, given the subject and relation as query, we use the first paragraph of DRQA's (Chen et al., 2017) retrieved document as external contexts.

Overall Conclusion
Conclusion 3. Additional context helps MLMs to predict the answer because they contain the answer, explicitly or implicitly. Several studies (Petroni et al., 2020;Bian et al., 2021) show that external context can help knowledge extraction from MLMs. To investigate the underlying mechanism, we evaluate which kinds of information in contexts contribute to the fact prediction, and find that the improvement mainly comes from the answer leakage in context. Furthermore, we find the answers can not only be leaked explicitly, but can also be leaked implicitly if the context provides sufficient information.

Explicit Answer Leakage Helps
Finding 6. Explicit answer leakage significantly improves the prediction performance.
To show this, we split LAMA into two parts ac-   Table 7: Comparison between prompt-based and context-based paradigms grouped by whether the masked answer in the context can be reconstructed from the remaining context. We can see that contexts can reconstruct the masked answer is more likely to improve the performance.
cording to whether the additional context contains the answer. Table 5 shows the results on these two parts respectively. We can see that the improvements on these two parts diverge significantly. For context containing the answer, context-based inference significantly improves the factual knowledge extraction performance. However, there is even a little performance drop for those instances whose context does not contain the answer. This indicates that the improvement of factual knowledge extraction is mainly due to the explicit existence of the answer in the context.

Implicit Answer Leakage Helps
Finding 7. Implicit answer leakage can also significantly improve the prediction performance. As we mentioned above, explicit answer leakage significantly helps the answer prediction. The answer-leaked context may explicitly provide the answer or implicitly guide the prediction by providing answer-specific information. To understanding the underlying mechanism, we mask the answer in the context and verify whether it can still achieve the performance gain. Table 6 shows the results. We find that the performance gain is still very significant after masking the answer. This indicates that the contexts previously containing the answer are still very effective even the answer doesn't explicitly present. To further investigate the reason behind, we split the masked version of answer-leaked instances into two groups by whether MLMs can or cannot correctly reconstruct the masked answer from the re-maining context. The results are shown in Table 7. We can see that the performance gain significantly diverges in these two groups: the improvements mainly come from the instances whose answer in context can be reconstructed -we refer to this as implicit answer leakage. That is to say, for these instances, the context serves as a sufficient delegator of its answer, and therefore MLMs can obtain sufficient answer evidence even the answer does not explicitly appear. However, for contexts that cannot reconstruct the masked answer, the improvements are relatively minor. In conclusion, the real efficacy of context-based inference comes from the sufficient answer evidence provided by the context, either explicitly or implicitly.

Conclusions and Discussions
In this paper, we thoroughly study the underlying mechanisms of MLMs on three representative factual knowledge extraction paradigms. We find that the prompt-based retrieval is severely promptbiased, illustrative cases enhance MLMs mainly via type guidance, and external contexts help knowledge prediction mostly because they contain the correct answer, explicitly or implicitly. These findings strongly question previous conclusions that current MLMs could serve as reliable factual knowledge bases.
The findings of this paper can benefit the community in many directions. By explaining the underlying predicting mechanisms of MLMs, we provide reliable explanations for many previous knowledgeintensive techniques. For example, our method can explain why and how incorporating external contexts will help knowledge extraction and Common-senseQA (Talmor et al., 2019). Our findings also reveal why PLM probing datasets may not be reliable and how the evaluation can be promoted by designing de-biased evaluation datasets.
This paper also sheds light on future research directions. For instance, knowing the main benefit of illustrative cases comes from type-guidance, we can enhance many type-centric prediction tasks such as NER (Lample et al., 2016) and factoid QA (Iyyer et al., 2014). Moreover, based on the mechanism of incorporating external contexts, we can better evaluate, seek, and denoise external contexts for different tasks using MLMs. For example, we can assess and select appropriate facts for CommonsenseQA based on whether they can reconstruct the candidate answers. This paper focuses on masked language models, which have been shown very effective and are widely used. We also want to investigate another representative category of language models -the generative pre-trained models (e.g., GPT2/3 (Radford et al., 2019;Brown et al., 2020)), which have been shown to have quite different mechanisms and we leave it for future work due to page limitation.

A WIKI-UNI Construction Details
To construct WIKI-UNI, we first collect all the triples which belong to the same 41 relations with LAMA from Wikidata (Vrandečić and Krötzsch, 2014), then we randomly sample 50K triples with a single-token object for each relation. Similar to LAMA, we filter out the instances whose object is not in MLMs' vocabulary. For each relation, we group the instances based on different objects, and indicate f o as the frequency of each object. We denote the median of f o with f m . For groups where f o > f m , we randomly sample f m instances, and delete the groups where f o < f m . Therefore, we acquire a dataset named WIKI-UNI with a uniform answer distribution. There are 70K facts in WIKI-UNI and 34K facts in LAMA. Since BERT and RoBERTa have a different vocabulary, so the datasets for their evaluation are slightly different.

B Results on RoBERTa-large
Our conclusions are similar on BERT-large and RoBERTa-large, therefore, we report the results of BERT-large in the article and results of RoBERTalarge here. Figure 7 shows the very significant correlation between the prediction distributions on LAMA and WIKI-UNI for RoBERTa-large: on all three kinds of prompts, the Pearson correlation coefficient between these two prediction distributions exceeds 0.9 in most relations. Table 8 shows the percentage of instances that the topk object entities cover for RoBERTa-large. B.2 Case-based Analogy Table 9 shows the performance improvement after introducing illustrative cases for RoBERTa-large model, we can see that the illustrative cases could also significantly increase the knowledge extraction  performance for RoBERTa-large. Table 14 shows how the entity types of predictions changed after introducing the illustrative cases for RoBERTa-large model, the conclusion is similar with BERT-large. Figure 8 shows the percentage on the change of overall rank and in-type rank for RoBERTa-large model.

B.1 Promp-based Retrieval
And another finding is that BERT-large has a better type prediction ability than RoBERTa-large, even without illustrative cases. We calculate the overall type precision over prompt-based paradigm (the percentage of predictions that the type is correct). And the type precision for BERT-large is 68% and for RoBERTa-large is only 51%, which partly explains why performance of RoBERTa-large is significantly worse than BERT-large on LAMA dataset.    Table 10 shows the comparison of contexts group by whether the contexts contain the answer for RoBERTa-large. We can see that for contexts containing the answer, context-based inference significantly improves the factual extraction performance. Meanwhile, there is a performance drop for those instances whose context does not contain the answer. Table 11 shows the overall performance improvements when introducing different external contexts for RoBERTa-large. Table 12 shows the comparison of the masked contexts based on whether they can/cannot reconstruct the masked answer for RoBERTa-large. The improvements mainly comes from the instances whose answer in contexts can be reconstructed.

Answer in context
Prompt-based Context-based ∆     Table 13 shows the detailed analysis of all relations using case-based analogy paradigm for BERTlarge and Table 14 is the results on RoBERTalarge. Because of the page limit, another finding we didn't mention in the article is that, apart from "type guidance", the illustrative cases could also provide a "surface form guidance" in a few relations (e.g., part of, applies to jurisdiction, subclass of). Specifically, the "surface form" indicate that the object entity name (e.g., Apple) is a substring of the subject entity name (e.g., Apple Watch). Such phenomenon is also mentioned in Poerner et al. (2020).  Table 13: A detailed analysis of all relations using case-based analogy paradigm for BERT-large, which is corresponding to Table 4 in the article. "-" indicates the number of queries whose predictions are reversed correctly or mistakenly is less than 3.  Table 14: A detailed analysis of all relations using case-based analogy paradigm for RoBERTa-large, which is corresponding to Table 4 in the article. "-" indicates the number of queries whose predictions are reversed correctly or mistakenly is less than 3.