When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Despite their impressive performance on diverse tasks, large language models (LMs) still struggle with tasks requiring rich world knowledge, implying the difficulty of encoding a wealth of world knowledge in their parameters. This paper aims to understand LMs’ strengths and limitations in memorizing factual knowledge, by conducting large-scale knowledge probing experiments on two open-domain entity-centric QA datasets: PopQA, our new dataset with 14k questions about long-tail entities, and EntityQuestions, a widely used open-domain QA dataset. We find that LMs struggle with less popular factual knowledge, and that retrieval augmentation helps significantly in these cases. Scaling, on the other hand, mainly improves memorization of popular knowledge, and fails to appreciably improve memorization of factual knowledge in the tail. Based on those findings, we devise a new method for retrieval-augmentation that improves performance and reduces inference costs by only retrieving non-parametric memories when necessary.


Introduction
Large language models (LMs; Brown et al. 2020;Raffel et al. 2020) have been shown to be competitive on diverse NLP tasks, including knowledgeintensive tasks that require fine-grained memorization of factual knowledge (Chowdhery et al., 2022;Yu et al., 2022). Meanwhile, LMs have also been shown to have limited memorization for less frequent entities (Kandpal et al., 2022), are prone to hallucinations (Shuster et al., 2021), and suffer from temporal degradation (Kasai et al., 2022;Jang et al., 2022). Incorporating non-parametric knowledge (i.e., retrieved text chunks) largely helps address those issues stemming from reliance on LMs' parametric knowledge-knowledge stored 1 Our code and data are available at https://github. com/AlexTMallen/adaptive-retrieval. in their parameters (Izacard et al., 2022b)-but it is unclear whether it is strictly superior or complementary to parametric knowledge. Understanding when we should not trust LMs' outputs is also crucial to safely deploying them in real-world applications (Kadavath et al., 2022). This work conducts a large-scale knowledge probing of LMs on factual knowledge memorization, to understand when we should and should not rely on LMs' parametric knowledge, and how scaling and non-parametric memories (e.g., retrievalaugmented LMs) can help. In particular, we aim to address the following research questions:

Not memorized in parameters
(RQ 1 ) How much factual knowledge is memorized by LMs and what factors affect the memorization? (Section 4) (RQ 2 ) To what extent can non-parametric memories alleviate the shortcomings of parametric memories of LMs? (Section 5) (RQ 3 ) Can we build a system to adaptively combine non-parametric and parametric memories? (Section 6) We hypothesize that factual knowledge frequently discussed on the web is easily memorized by LMs, while the knowledge that is less discussed may not be well captured and thus they require retrieving external non-parametric memories. We evaluate ten large LMs of three families (i.e., OPT, with varying scales on the open-domain question answering (QA) task in a zero-or few-shot prompting manner. We construct a new dataset, POPQA, consisting of 14k questions to cover factual information in the long tail that might have been missed in popular QA datasets (Kwiatkowski et al., 2019). We use Wikipedia page views as a measure of popularity and convert knowledge triples from Wikidata, with diverse levels of popularity, into natural language questions, anchored to the original entities and relationship types. We also use EntityQuestions (Sciavolino et al., 2021), an open-domain QA dataset with a long-tail distribution.
On both datasets, LMs' memorization (RQ 1 ) is often limited to the popular factual knowledge and even GPT-3 davinci-003 fails to answer the majority of the long-tail questions. Moreover, on such questions, scaling up models does not significantly improve the performance (e.g., for the 4,000 least popular questions in POPQA, GPT-j 6B has 16% accuracy and GPT-3 davinci-003 has 19% accuracy). This also suggests that we can predict if LMs memorize certain knowledge based on the information presented in the input question only.
We next investigate whether a semi-parametric approach that augments LMs with retrieved evidence can mitigate the low performance on questions about less popular entities (RQ 2 ). Nonparametric memories largely improve performance on long-tail distributions across models. Specifically, we found that retrieval-augmented LMs are particularly competitive when subject entities are not popular: a neural dense retriever (Izacard et al., 2022a)-augmented GPT-neo 2.7B outperforms GPT-3 davinci-003 on the 4,000 least popular questions. Surprisingly, we also find that retrieval augmentation can hurt the performance of large LMs on questions about popular entities as the retrieved context can be misleading.
As a result, we devise a simple-yet-effective retrieval-augmented LM method, Adaptive Retrieval, which adaptively combines parametric and non-parametric memories based on popularity (RQ 3 ). This method further improves performance on POPQA by up to 10%, while significantly reducing the inference costs, especially with larger LMs (e.g., reducing GPT-3 API costs by half), indicating the potential for future research in more efficient and powerful retrieval-augmented LMs.

Related Work
Parametric and non-parametric knowledge. Petroni et al. (2019) demonstrate that large pretrained LMs such as BERT (Devlin et al., 2019) memorize the significant amount of world knowledge in their parameters (parametric knowledge), and Roberts et al. (2020) show that fine-tuned T5 without any reference documents (closedbook QA) can achieve competitive performance on open-domain QA. More recent and powerful LMs (Brown et al., 2020;Chowdhery et al., 2022) further improve performance on diverse knowledge-intensive tasks, leveraging their strong parametric memories (Kandpal et al., 2022;Yu et al., 2022). However, relying solely on their parameters to encode a wealth of world knowledge requires a prohibitively large number of parameters and the knowledge can become obsolete quickly (Kasai et al., 2022;Jang et al., 2022). Recent work shows that augmenting LMs with nonparametric memories (i.e., retrieved text chunks) enables much smaller models to match the performance of larger models (Izacard et al., 2022b;Khandelwal et al., 2020;Min et al., 2022), although Chen et al. (2022) and Longpre et al. (2021) show that even those models can ignore non-parametric knowledge and rely on parametric knowledge.
Understanding memorization. Several prior work establishes a positive relationship between string frequency in pre-training corpora and memorization (Carlini et al., 2022;Razeghi et al., 2022). Concurrent to our work, Kandpal et al. (2022) show that the co-occurrence of the question and answer entities in pretraining corpora has a positive correlation with models' QA accuracy on popular open-domain QA benchmarks such as Natural Questions (Kwiatkowski et al., 2019). This work, instead, attempts to predict memorization using the variables available in the input question only and uses popularity to obtain a proxy for how frequently an entity is likely to be discussed on the web. Importantly, by constructing a new dataset, we can conduct fine-grained controlled experiments across a wide range of popularities, allowing the investigation of hypotheses that might have been missed in prior analysis using existing open QA datasets. We further analyze the effectiveness and limitations of Figure 2: POPQA is created by sampling knowledge triples from Wikidata and converting them to natural language questions, followed by popularity calculation. retrieval-augmented LMs and introduce Adaptive Retrieval. Prior work investigates the effectiveness of deciding when to use non-parametric memories at the token level in k-nn LM (He et al., 2021). This work is the first work to study the effectiveness of deciding whether to retrieve for each query and show their effectiveness in retrieval-augmented LM prompting.

Evaluation Setup
We evaluate LMs' ability to memorize factual knowledge through closed-book QA tasks with fewshot samples. We evaluate LMs on our new dataset, POPQA (

Focus and Task
Focus: factual knowledge. Among diverse types of world knowledge, this work focuses on factual knowledge (Adams, 2015) of entities-knowledge about specific details of the target entities. We define factual knowledge as a triplet of (subject, relationship, object) as in Figure 2 left. given a question, a model predicts an answer without any pre-given ground-truth paragraph. 2 As in Kandpal et al. (2022), we study few-shot settings and prompt LMs without any parameter updates, instead of fine-tuning them on QA datasets such as in Roberts et al. (2020). Metrics: accuracy. We mark a prediction as correct if any substring of the prediction is an exact match of any of the gold answers.

Dimensions of Analysis
We hypothesize that factual knowledge that is less frequently discussed on the web may not be wellmemorized by LMs. Previous research often uses the term frequency of object entities in pretraining corpora to understand memorization (Févry et al., 2020;Kandpal et al., 2022;Razeghi et al., 2022). Instead, we investigate whether it's possible to predict memorization based on the input information only, and then apply the findings for modeling improvements, unlike prior analyses. Therefore, our work focuses on the other two variables in a factual knowledge triple: the subject entity and the relationship type. Subject entity popularity. We use the popularity of the entities measured by Wikipedia monthly page views as a proxy for how frequently the entities are likely to be discussed on the web, instead of using the occurrence of entities or strings in the pretraining corpus (Carlini et al., 2022;Kandpal et al., 2022;Razeghi et al., 2022). Calculating frequencies over large pretraining corpora requires massive computations to link entities over billions of tokens, or can result in noisy estimations. 3 Our initial studies show that this is much cheaper 4 and aligns well with our intuition. Relationship type. We also consider the relationship types as key factors for factual knowledge memorization. For example, even given the same combinations of the subject and object entities, model performance can depend on the relationship types; relationship types widely discussed can be easier to be memorized, while types that are less discussed may not be memorized much.  Figure 3.
To construct POPQA, we randomly sample knowledge triples of 16 diverse relationship types from Wikidata and convert them into natural language questions, using a natural language template (depicted in Figure 2). We verbalize a knowledge triple (S, R, O) into a question that involves substituting the subject S into a template manually written for the relationship type R. The full list of templates is found in Table 2 of the Appendix. The set of acceptable answers to the question is the set of entities E such that (S, R, E) exists in the knowledge graph. We tried various templates and found that the results were fairly robust to the templates. Since POPQA is grounded to a knowledge base, links to Wikidata entities allow for reliable analysis of popularity and relationship types. EntityQuestions. We test on another popular opendomain QA dataset, EntityQuestions (Sciavolino et al., 2021), which also covers a long-tail entity distribution. They use Wikipedia hyperlink counts as a proxy of the frequency of entities and sample knowledge triples from WikiData, from the frequency distributions. Unlike POPQA, EntityQuestions doesn't provide entity annotations, so we only use 82% of the questions, where the mention of the subject entity has a unique match with a Wikidata entity.

Memorization Depends on Popularity and Relationship Type
We evaluate a range of LMs with varying numbers of parameters, to quantify how much factual knowledge they memorize and how different factors affect those memorization behaviors (RQ 1 ). Instructions and demonstrations. We use a simple template "Q: <question> A:" to format all of our questions for generative prediction. More sophisticated instructions were attempted in preliminary experiments but they did not improve upon the simple template significantly enough to merit using them, especially given that they may overfit to the model. While we use zero-shot prompting for GPT-3 to reduce API costs, 6 we use 15-shot prompting for all GPT-neo and OPT models.

Results
Overall model performance. The top left column of Figure 4 illustrates the overall performance on POPQA. As shown, even without using incontext examples, larger LMs exhibit reasonable performance: GPT-3 achieves 35% accuracy, and GPT-Neo 20B achieves 25% accuracy. This indicates that large LMs memorize factual knowledge in their parameters to some extent. This section examines which types of knowledge are better memorized and what factors influence memorization.
Subject entity popularity predicts memorization. Figure 4 (bottom) shows that there is a positive correlation between subject entity popularity and models' accuracy for almost all relationship types. This supports our hypothesis that subject entity popularity can be a reliable indicator of LMs' factual knowledge memorization. In general, the correlations between subject entity popularity and accuracy are stronger for larger LMs; GPT-3 003 shows the highest positive correlation (roughly 0.4) while GPT-Neo-1.3B shows relatively weak positive correlations (approximately 0.1).
Relationship types affects memorization. We find that models have a higher average performance for some relationship types than for others. While this is evidence that factual knowledge of some relationship types are more easily memorized than others, we also observe that questions of certain relationship types can be easily guessed without memorizing the knowledge triple. Specifically, certain relationship types (e.g., nationalities) allow models Correlation between log(popularity) and accuracy to exploit surface-level artifacts in subject entity names (Poerner et al., 2020;Cao et al., 2021). Additionally, models often output the most dominant answer entities for questions about relationship types with fewer answer entities (e.g., red for the color relationship type). In Figure 4, relationships with lower correlation (e.g., country, sport) often shows higher accuracy, indicating that on those relationship types, models may exploit surface-level clues. On the other hand, for relationship types with relatively low accuracy (e.g., occupation, author, director), larger LMs often show a high correlation. Further details are in Appendix C.1.
Scaling may not help with tail knowledge. As seen in the left column of Figure 4, there are clear overall performance improvements with scale on the POPQA dataset. However, Figure 5 shows that on both POPQA and EntityQuestions, most of scaling's positive effect on parametric knowledge comes from questions with high popularity. Specifically, for the questions about the entities whose log 10 (popularity) is larger than 4, there is an improvement in accuracy as model size increases (red and yellow lines), while performance on questions with lower popularity remains relatively constant (blue and green lines). For the 4,000 least popular questions, GPT-Neo 6B, 20B, and GPT-3 davinci-003 have 15%, 16%, and 19% accuracy, respectively.
This somewhat dampens prior works' findings that scaling up models significantly improves their factual knowledge memorization (Roberts et al., 2020;Kandpal et al., 2022). We hypothesize that this is because their evaluations are often conducted on QA datasets with popular entities. In sum, scaling lowers the threshold of popularity for knowledge to be reliably memorized, but is not projected to move the threshold far into the long tail for practical model scales.
Relationship type results breakdown. Figure 6 provides a closer look at the relationship between popularity, accuracy, and relationship type; it shows model accuracy over the popularity distributions for director and country. For the first two types, we can see a clear positive trend between popularity and accuracy across models, and as the model size gets larger, the LMs memorize more.
On the other hand, in the "country" relationship type, no models show trends, while overall the accuracy is high, indicating the LMs often exploit artifacts to answer less popular questions. We show example models' predictions in Appendix Section C.3.

Non-parametric Memory Complements Parametric Memory
Our analysis indicates that even the current stateof-the-art LMs struggle with less popular subjects or certain relationship types, and increasing the model size does not lead to further performance improvements. In light of this, we extend our analysis to non-parametric sources of knowledge, as outlined in Section (RQ 2 ). Specifically, we investigate the effectiveness of retrieval-augmented LMs (Borgeaud et al., 2022;Lewis et al., 2020), which leverage non-parametric memories (i.e., retrieved text) to improve performance.

Experimental Setup
Augmenting input. In this work, we try a simple retrieval-augmented LM approach, where we run an off-the-  and Contriever (Izacard et al., 2022a). BM25 is a static term-based retriever without training, while Contriever is pretrained on large unlabeled corpora, followed by fine-tuning on MS MARCO (Bajaj et al., 2016). We also experiment with a parametric augmentation method, GenRead (Yu et al., 2022), which prompts LMs to generate rather than retrieve a contextual document to answer a question. We use the ten LMs in Section 4, resulting in 40 LMs and retrieval-augmented LMs. Retrieving non-parametric memories significantly improves the performance of smaller models. Complete results on POPQA are found in Figure 13. Enti-tyQuestions results are in Figure 14 of the Appendix.

Results
Retrieval largely improves performance. Figure 7 shows that augmenting LMs with nonparametric memories significantly outperforms unassisted vanilla LMs. A much smaller LM (e.g., GPT-Neo 2.7B) augmented by the Contriever retrieval results outperforms vanilla GPT-3. Large LMs such as GPT-3 also enjoy the benefits of nonparametric memories. Contriever gives 7% accuracy gains on top of GPT-3 davinci-003. Gen-Read shows little-to-no performance improvement over vanilla parametric knowledge for smaller models, while the technique shows sizeable gains for GPT-3, especially davinci-003. In addition to its limited effectiveness with smaller LMs, GenRead has potentially prohibitive inference time costs, with GPT-NeoX 20B taking 70 seconds per query. Non-parametric memories are effective for less popular facts. How does retrieval augmentation lead to such significant improvements? Figure 8 shows the relationship between the entity popularity and models' QA performance. It can be seen that retrieval-augmented LMs guided by Contriever or BM25 have a clear advantage over unassisted vanilla LMs, especially on less popular entities, resulting in a significant performance gain.  of Contriever for questions that GPT-3 davinci-003 answered correctly and incorrectly with and without retrieval on POPQA. The percent of questions falling in each category is shown in parentheses. For 10% of questions, retrieval is harmful due to low-quality retrieved text (0.14 recall@1).
ready memorized the answers, and augmenting input with retrieved-context doesn't help much or even hurts the performance. Interestingly, Gen-Read generally outperforms vanilla LMs despite relying on LMs' parametric memory. This demonstrates the effectiveness of elicitive prompting (Wei et al., 2022;Sun et al., 2022) as observed in prior work. However, like vanilla LMs, GenRead shows low performance on less popular entities.

Non-parametric memories can mislead LMs.
We conduct an in-depth analysis of why retrievalaugmented models suffer in more popular entities. We hypothesize that retrieval results may not always be correct or helpful, and can mislead LMs.
To test this hypothesis, we group the questions based on two axes: whether unassisted GPT-3 davinci-003 predict correctly or not, and whether retrieval-augmented predictions are correct or not. For each of the four categories, we calculate re-call@1 (whether a gold answer is included in the top 1 document; Karpukhin et al. 2020). Table 1 shows recall@1 for each group with percentages of the questions falling into each of the categories. For 10% of questions, retrievalaugmentation causes the LM to incorrectly answer a question it could otherwise answer correctly. We found that on those questions, recall@1 is significantly lower than the overall recall@1 (0.14 vs 0.42 overall), indicating that failed retrieval can result in performance drops. Conversely, for the 17% of questions for which retrieval causes the LM to correctly answer a question it would otherwise have failed to answer, the recall@1 is 0.88. We include examples of both cases in Appendix Section C.3.

Adaptive Retrieval: Using Retrieval
Only Where It Helps While incorporating non-parametric memories helps in long-tail distributions, powerful LMs have already memorized factual knowledge for popular entities, and retrieval augmentation can be harmful. As outlined in (RQ 3 ), can we achieve the best of both worlds? We propose a simple-yeteffective method, Adaptive Retrieval, which decides when to retrieve passages only based on input query information and augments the input with retrieved non-parametric memories only when necessary. We show that this is not only more powerful than LMs or retrieval-augmented LMs always retrieving context, but also more efficient than the standard retrieval-augmented setup.

Method
Adaptive Retrieval is based on our findings: as the current best LMs have already memorized more popular knowledge, we can use retrieval only when they do not memorize the factual knowledge and thus need to find external non-parametric knowledge. In particular, we use retrieval for questions whose popularity is lower than a threshold (popularity threshold), and for more popular entities, do not use retrieval at all. Using a development set, the threshold is chosen to maximize the adaptive accuracy, which we define as the accuracy attained by taking the predictions of the retrieval-augmented system for questions below the popularity threshold and the predictions based on parametric knowledge for the rest. We determine the popularity threshold independently for each relationship type. Adaptive Vanilla/BM25 Figure 9: POPQA performance of GPT-neo models and GPT3 davinci-003, with different retrieval methods. Adaptive Retrieval robustly outperforms approaches that always retrieve, especially for larger LMs.

Results
Adaptive Retrieval improves performance. Figure 9 shows the results when we adaptively retrieve non-parametric memories based on the perrelationship type thresholds. We can see that adaptively retrieving non-parametric memories is effective for larger models. The best performance on POPQA is using GPT-3 davinci-003 adaptively with GenRead and Contriever, yielding 46.5% accuracy, 5.3% higher than any non-adaptive method.  Figure 10: The proportion of questions for which various models use retrieval in the Adaptive Retrieval setup on POPQA. When using Adaptive Retrieval, small models must still rely on non-parametric memory for most questions, while larger models have more reliable parametric memories enabling them to use retrieval less often.
The threshold shifts with LM scale. While Adaptive Retrieval shows performance gains for larger models, smaller models do not realize the same benefits; as shown in Figure 9, the performance gain from Adaptive Retrieval is much smaller when we use models smaller than 10 billion. Why does this happen? Figure 10 shows that smaller LMs almost always retrieve, indicating that there are not many questions for which small LMs' parametric knowledge is more reliable than non-parametric memory. In contrast, large models typically retrieve much less. For example, GPT-3 davinci-003 only retrieves for 40% of questions when paired with BM25, and even the much smaller GPT-neox 20B does not retrieve documents on more than 20% of the questions. On EntityQuestions (Appendix Figure 15) all of the LMs retrieve much more, as the questions are mostly about less popular entities.

Adaptive Retrieval reduces inference-time costs.
We also found that Adaptive Retrieval improves efficiency; if we know we do not need to retrieve documents, we can skip retrieval components and the input length becomes shorter, which improves latency in both retrieval and language model components. Figure 11 shows the inference latency of GPT-J 6B and GPT-neox 20B, and API costs of GPT-3. Especially for larger LMs, concatenating retrieved context results in significantly increased latency (e.g., for GPT-J 6B, the inference time latency almost doubles). Adaptive retrieval enables reducing inference time up to 9% from standard  Figure 3), Adaptive Retrieval is able to reduce API costs by 15% while maintaining equivalent performance to retrieval only. retrieval. We also observe cost reduction on Enti-tyQuestions, as shown in Figure 12.

Discussion and Conclusions
This work conducts large-scale knowledge probing to examine the effectiveness and limitations of relying on LMs' parameters to memorize factual knowledge and to understand what factors affect factual knowledge memorization. Our results show that memorization has a strong correlation with entity popularity and that scaling up models on long-tail distributions may only provide marginal improvements. We also demonstrate that non-parametric memories can greatly aid LMs on these long-tail distributions, but can also mislead LMs on questions about well-known entities, as powerful LMs have already memorized them in their parameters. Based on those findings, we devise simple-yet-effective Adaptive Retrieval, which only retrieves when necessary, using a heuristic based on entity popularity and relationship types. Our experimental results show that this method is not only more powerful than LMs or previous retrieval-augmented LMs but also more efficient.

Limitations
This work focuses on entity-centric factual knowledge and demonstrates that LMs' memorization is heavily affected by the popularity of the entities and the aspect of the entities being asked in the questions. It is important to emphasize that for running controlled experiments, we have relied on two synthetic datasets, and the extent to which our results apply to naturally occurring factual knowledge has not been firmly established. While we can be fairly confident about the relationship between scaling, retrieval, popularity, relationship type, and performance for the kinds of knowledge studied here, the effectiveness of Adaptive Retrieval will depend on many details of the question answering pipeline. Moreover, our work depends on a definition of popularity that is time-dependent and may not perfectly reflect how frequently entities are discussed on the web. Wikipedia page views are one possible definition of popularity for which we observe our results, and we invite others to improve upon it in future work. Further research can expand upon this simple approach, perhaps drawing on insights from Kadavath et al. (2022) to improve the effectiveness of Adaptive Retrieval.
It is an open question if the same findings are applicable to other types of world knowledge such as commonsense. We conjecture that the concept of the subject topic (entity), as well as the aspect (relationship type), can be applied with some minor modifications, which future work can quantify memorization following our scheme.

Ethical Considerations
Recent work (Huang et al., 2022) shows that LMs memorize personal information available on the web, which has significant security issues. Our evaluation focuses on the memorization of general entity-centric knowledge, but our findings can be applicable to those areas. Our findings suggest that LMs are likely to have less reliable knowledge of minority groups. Parrish et al. (2022) established that models often rely on stereotypes to answer in uncertain cases, so our results indicate that LMs are likely to rely on stereotypes disproportionately for minority groups. Future work could investigate whether retrieval augmentation reduces bias in these cases. List of the relationship types and templates. In this work, we use the following 16 relationship types, and the authors of this paper manually annotated templates to verbalize knowledge triple to natural language questions. We show the final list of the templates used to create POPQA in Table 2. Figure 3 shows the distribution of subject popularity of POPQAand EntityQuestions versus the popular NQ benchmark. NQ may have multiple entities so the distribution of the least popular entity per question is shown. Subject entities from NQ were extracted using TagMe (Ferragina and Scaiella, 2010) on the NQ-open development set with a score threshold of 0.22. TagMe returns the title of a Wikidata entity which can be directly used to find popularity.  Knowledge triples sampling. In the construction of the POPQAdataset, knowledge triples are sampled with higher weight given to more popular entities, otherwise, the distribution would be dominated by the tail and we would not have enough high-popularity entities to complete our analysis. Specifically, when considering whether to sample a particular knowledge triple, we include the knowledge triple if and only if f > exp(8R − 6), where R ∼ U (0, 1) is a unit uniform pseudo-random number and f is the exact match term frequency of the subject entity's aliases in an 800 MB random sample of C4. To increase diversity, once 2000 knowledge triples of a particular relation type have been sampled, they are no longer sampled.
To run experiments using LMs larger than two billion parameters, we use a single V100 Volta GPU with 32GB GPU memories. We use int8bit (Zeng et al., 2022) quantization with OPT 13 billion and GPT-Neo 20 billion models to make them fit our GPUs. In our preliminary experiments using GPT-Neo 6 billion, we did not observe a notable performance drop by using the quantization.
Constructing few-shot contexts. For POPQA, we sample few-shot examples stratified by relationship type to diversify the samples: for each of the 15 relationship types other than the one in the test question, we sample one random question-answer pair to include in the context. For EntityQuestions, we take a simple random sample of 15 questionanswer pairs because there are more than 16 relationship types.
Details of deciding thresholds. We 75% of POPQAto determine a popularity threshold for each relation type. Using brute force search, we select the threshold to maximize the adaptive accuracy, which we define as the accuracy attained by taking the predictions of the retrieval-augmented system for questions below the popularity threshold and the predictions based on parametric knowledge for the rest.
We then evaluate adaptive accuracy using the learned thresholds on the remaining 25% of POPQA, and repeat with 100 different random splits and take the mean to obtain the reported adaptive accuracy measurement.  C Detailed Results

C.1 LM results
Full results of per-relationship type accuracy and correlation. Figure 16 shows the full result of per-relationship type accuracy for all relationship types in POPQA. Figure 17 shows the correlations for all relation types. Figures 19 and 18 show the same results for the EntityQuestions dataset.
Negative correlations of capital on EntityQuestions. As shown in Figure 19, the capital relationship types on in EntityQuestions, while on POPQA, this relationship shows relatively high correlations. We found that in EntityQuestions, this capital relationship type has many low-popularity questions whose answers are included in subject entity names (e.g., subject="canton of Marseille-Belsunce", ob-ject="Marseille"). This causes performance to have a U-shaped relationship with popularity for the capital relationship type, so if most of the questions sampled come from the top half of popularity, the linear correlation will be positive, and vice versa.

C.2 Retrieval-augmented LM results
Overall performance of retrieval-augmented LMs. Figure 13 shows the overall performance of 40 LMs and retrieval-augmented LMs on POPQA.
Retrieval-augmentation largely improves performance across different LMs, and much smaller models (GPT-Neo 1.3B) can perform on per with GPT-3. Figure 14 shows the results on EntityQuestions. Due to computational and time constraints, we were only able to run vanilla and Contriever results for most models.
Adaptive Retrieval for EntityQuestions. Figure 15 shows the proportion of questions above the retrieval threshold for various models using Adaptive Retrieval on EntityQuestions. Because Enti-tyQuestions has a large quantity of low-popularity questions, models (especially smaller ones) must rely heavily on retrieval.
Full results on all relationship types. Figure 20 shows the full results on POPQA of the retrievalaugmented LMs and unassisted LMs on 16 relationship types using three different LMs as backbones. Figure 21 shows these results for GPT-3 davinci-003 on EntityQuestions. Table 3 shows several examples on POPQA, where GPT-3 davinci-003 answers correctly while the Contriever-augmented version fails to answer. Along with the low recall@1 of 0.14 for this group, Table 3 suggests that the most common reason retrieval can be harmful is that it retrieves a document about a mistaken entity, such as a person with the same name as the subject, or an entity that simply is not relevant to the question (as in the case of "Noel Black").       Table 4 shows several examples on POPQA, where GPT-3 davinci-003 answers correctly only when augmented with Contriever. The recall@1 for this case is 0.88, which is significantly higher than the overall recall. Note that in the second example, the retrieval caused the LM to answer correctly, but only by coincidence: the subject entity "Pierre" actually refers to the city in South Dakota, not the Basketball player. Otherwise, retrieval appears to be helpful because it provides the relevant information directly.  (born August 17, 1959) is an American applied mathematician who works on the modeling and simulation of complex systems arising in physics and biology. This has included free-boundary problems in fluids and materials science... He is also the co-founder and co-director of the Courant Institute's Applied Mathematics Lab.

C.3 Qualitative Results
In what city was Zijah Sokolović born? (Sarajevo) Zijah Sokolović was born in Sarajevo ✓ Zijah Sokolović was born in Orahovac, Kingdom ✗ Ali Sokol ... (born 8 May 1921 in Orahovac, Kingdom of Serbs, Croats and Slovenes, died 23 September 1974) was a Yugoslav pulmonologist . Ali Sokol was born into an agricultural family. He was the third of four children of father and mother Hatixhes Solomon. It is unknown the exact date of birth but the most reliable date is May 8 year in 1921.

Lazar Ristovski ✓
In 1999 "The White Suit" an auteur film by Ristovski (director, writer, lead actor, and producer) was at the Cannes Film Festival in the Critics Week program. "The White Suit" was the Serbian entry for the 1999 Academy Awards. Lazar Ristovski is the sole owner of Zillion Film Company In 2006, he made a small appearance in the James Bond film "Casino Royale". He played Caruso in the 2004 movie "King of Thieves". He starred as Ðorde in the award-winning 2009 film "St. George Shoots the Dragon". Table 4: Qualitative examples of the questions where only retrieval-augmented LMs successfully answer correctly. The blue underlined text indicates the sub-strings matching the gold answers in the retrieved context.