Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models

While pre-trained language models (PLMs) have shown evidence of acquiring vast amounts of knowledge, it remains unclear how much of this parametric knowledge is actually usable in performing downstream tasks. We propose a systematic framework to measure parametric knowledge utilization in PLMs. Our framework first extracts knowledge from a PLM's parameters and subsequently constructs a downstream task around this extracted knowledge. Performance on this task thus depends exclusively on utilizing the model's possessed knowledge, avoiding confounding factors like insufficient signal. As an instantiation, we study factual knowledge of PLMs and measure utilization across 125M to 13B parameter PLMs. We observe that: (1) PLMs exhibit two gaps - in acquired vs. utilized knowledge, (2) they show limited robustness in utilizing knowledge under distribution shifts, and (3) larger models close the acquired knowledge gap but the utilized knowledge gap remains. Overall, our study provides insights into PLMs' capabilities beyond their acquired knowledge.


Introduction
Recent research has demonstrated that language models pre-trained on vast amounts of internet data acquire a broad range of knowledge about linguistic structures (Tenney et al., 2019b;Blevins et al., 2022), encyclopedic relations (Petroni et al., 2019;Hao et al., 2022), levels of commonsense (Zhou et al., 2020;Liu et al., 2022a) , and even coding and reasoning rules (Chen et al., 2021;Wei et al., 2022b).Recent studies on behavioral parametric probing and prompting (Jiang et al., 2020;Qin and Eisner, 2021;Brown et al., 2020) has demonstrated that such knowledge, collectively referred to as "parametric knowledge," resides reliably within a subset of trained parameters in pre-trained models (PLMs).Importantly, this knowledge can be identified without additional finetuning.For instance, † Equal advising. .Gap 2 exists in how much of this knowledge can actually be utilized in downstream tasks (the usable knowledge).We find that although the first gap mostly shrinks, the second remains as we increase the model's size.
given the prompt "The capital of France is", a PLM can be queried to complete the input and extract the fact "Paris".
A common assumption about parametric knowledge is that if the model poses a certain type of knowledge, it utilizes it when performing downstream tasks related to that knowledge.For example, if a model knows about X and Y (such that X and Y are similar), and is taught to perform a task on X, the convention is that the model generalizes the application of the task on Y and all other similar knowledge.Such is the foundation for the recent interest in instruction tuning (Wei et al., 2022a;Chung et al., 2022), and the SFT-RLHF pipeline (Ouyang et al., 2022).In this paradigm, LLMs are finetuned to learn how to follow instructions on a few tasks the model is capable of and are subsequently expected to generalize and follow instructions for novel tasks by utilizing their pre-training knowledge (residing in their parameters).
However, it is not clear to what extent this as- (3) The evaluation on the application of acquired knowledge is estimated through the performance on the test split, after finetuning M θ on the downstream task.
sumption holds in practice, giving rise to a central question: how much of parametric knowledge will get applied in downstream tasks?If the causal link between "identifiable knowledge" and its practical application in downstream tasks is not established (Kulmizev and Nivre, 2021), the mere presence of knowledge within a model's parameters does not necessarily guarantee its utilization in such tasks.This raises questions about the assertion of pretrained language models (PLMs) as differentiable knowledge bases (Hao et al., 2022) and their overall capabilities.For instance, as demonstrated by Qin et al. (2023), ChatGPT's performance lags behind its foundational model, GPT-3.5, in areas including commonsense and logical reasoning tasks.
Previous studies have investigated this question within linguistic domains and have demonstrated that although PLMs have the capacity to encode linguistic knowledge, they may not effectively employ it in downstream tasks.For example, McCoy et al. (2019) illustrates that PLMs employ syntactic heuristics to solve NLI even though they are able to represent proper linguistic hierarchies (Tenney et al., 2019a), even after finetuning (Merchant et al., 2020;Zhou and Srikumar, 2022).Warstadt et al. (2020) provide evidence that RoBERTa requires data inoculation or pre-training with extensive data in order to effectively utilize its hierarchical linguistic knowledge.In a more recent study, Lover-ing et al. (2021) demonstrate that the quantity of "evidence" presented in the finetuning dataset influences the features that PLMs rely on during the finetuning process.Specifically, the model may resort to lexical heuristics when the finetuning signal toward linguistic features is insufficient.
In this work, we are interested in a more general sense of knowledge and propose XTRAEVAL -EXTRACT, TRAIN, AND EVALUATE -to systematically measure how much of parametric knowledge is utilized in downstream tasks.XTRAEVAL sidesteps potential confounders (such as shortcuts or insufficient signal) that arise from the nature of arbitrary crowd-sourced tasks used in prior work by carefully creating the downstream task from the model's own knowledge.Specifically, given a pre-trained language model, our framework first identifies and extracts knowledge residing in its parameters.Subsequently, using the extracted knowledge, we construct a downstream task on which we finetune the model.Finally, we measure knowledge utilization based on its performance on the downstream task.By constructing the task based on the model's pre-existing knowledge, we ensure that (1) the model is evaluated solely on its possessed knowledge, avoiding penalties for lacking information and (2) successful task completion relies explicitly on utilizing the model's parametric knowledge, eliminating the insufficient training sig-nal issue and dataset shortcuts.
In this paper, we provide the first instantiation of this paradigm based on encyclopedic knowledge facts and conduct an extensive study to measure knowledge utilization of PLMs across a wide range of parametric scales (ranging from 125M to 13B).We observe the following: • PLMs show two different but equally important gaps: (1) The gap in the acquired knowledge and (2) and the gap in parametric knowledge that can be actively applied to downstream tasks (Section 3).
• PLMs are not robust to finetuning distribution shifts, and failure to utilize knowledge worsens with such shifts, further questioning their generalization capabilities (Section 4).
• Although scaling the number of parameters helps to close the first gap, the second still remains in larger sizes (Section 5).
In the next sections, we first describe our framework and its instantiation in detail (Section 2), and finally present our experimental results in Sections 3 to 5.

EXTRACT, TRAIN, AND EVALUATE
Principles The primary objective of our evaluation framework is to measure how much of the knowledge present in the model's parameters is actually usable in downstream tasks.Ideally, downstream tasks must be designed in a way that solely attributes any success to the model's knowledge being used, while ensuring that failure in performing the task is not due to a lack of pre-training knowledge.

The Paradigm
To this end, we propose EX-TRACT, TRAIN, AND EVALUATE , which consists of three main steps: Step 1.Given a pre-trained model M θ with parameters θ and a diagnostic dataset D (e.g. a set of encyclopedic facts or coding problems), we first extract and identify parametric knowledge as a set of data instances x ∈ D the model can solve without further training (zero-shot).We denote such a set as D θ , a realization of M θ 's parametric knowledge w.r.t D.
Step 2. We construct a downstream task K around the model's own knowledge D θ (e.g.fact retrieval or following instructions in coding) such that the model can only solve the task by utilizing the knowledge identified in the first step.More formally, we create K θ train and K θ test as the nonoverlapping train and test sets of downstream task K, where the model learns the task from K θ train .
Step 3. Finally, the performance on the test set K θ test is used as a measure of the model's ability to utilize its knowledge.
Constructing the downstream task based on the model's knowledge ensures that the model is not evaluated on the knowledge it did not acquire during pre-training.Also, the I.I.D. nature of this paradigm (i.e. the model is only exposed to inputs it is already familiar with) allows us to measure whether the model can utilize its knowledge at all.

Encyclopedic Knowledge
Factual parametric knowledge as in encyclopedic facts is well-studied in PLMs (Petroni et al., 2019;Jiang et al., 2020) and allows for an objective and systematic evaluation of our framework (Figure 2).Therefore, in this paper, we instantiate XTRAEVAL to measure the utilization of parametric knowledge concerning encyclopedic facts.In this case, the diagnostic dataset D is a set of encyclopedic facts D = {⟨h, r, t⟩ i } n i=1 acquired from an off-the-shelf knowledge base (e.g.Wikipedia).Each fact x i ∈ D is a tuple of the form ⟨head, relation, tail⟩, such as ⟨Barack Obama, GraduatedFrom, Harvard⟩.
In the extraction phase, a pre-trained model M θ has to zero-shot predict the tail entity t given the head entity h and the relation r.We use softprompting (Qin and Eisner, 2021) to obtain the model's predictions, as it enhances prediction consistency compared to discrete prompts, particularly for moderate-sized models.The extracted knowledge D θ ⊂ D is the subset of tuples the model can predict correctly.
Our downstream task K is a standard document retrieval task (Karpukhin et al., 2020).Given a query q, the model retrieves the relevant document from a set of candidates.We construct K θ from the extracted knowledge in D θ by converting each fact x ∈ D θ into a retrieval instance k ∈ K θ .This conditions the downstream task on the model's knowledge.The conversion generates a query q by removing the tail entity t from x.It then generates relevant and irrelevant documents using a stochastic generator We partition D θ randomly (60%-40%) to generate K θ train and K θ test , which serve as the training and test sets for the downstream task, respectively.We finetune the model on K θ train in cross-encoder setup (Nogueira and Cho, 2020) with the InfoNCE objective (van den Oord et al., 2019): .
The similarity score sim(., ) is computed as where h is a randomly initialized value head that takes the representation of the [CLS] token (or the last token for decoder-only models) and outputs a scalar as the similarity measure (Figure A.1). Finally, we evaluate the model on K θ test by measuring its accuracy in retrieving the relevant document d + among {d + , d − 1 , . . ., d − m } for a given query q.The task design ensures that the association between knowledge query q i and gold fact document d + i relies solely on the parametric knowledge represented by x i ∈ D θ This is because other variables, like text overlap, are randomly sampled from the same distribution for both query and documents.
Thus, the model can only solve the task by utilizing its internal knowledge.Finetuning on K θ train should only trigger the utilization of the parametric knowledge.
Training The document generator P(d | •) can generate various types of documents for each fact x ∈ D θ .Please refer to Table 1 for a list of all the types.For training, we use three types for negative documents d − 's with uniform weights: (h, r, •), (•, r, t), and (h, •, t) as they are the hardest ones since they only differ in one entity from the query.To keep the GPU memory usage under control, we sample four documents per each type (refer to Section 3.1 for the effect of the number of negatives on the results), which results in a total of m = 12 negatives.We resample the documents on each epoch to avoid overfitting and use a validation set to choose the best checkpoint.Also, we keep the learning rate low and use no weight decay to prevent any forgetting.We use three seeds for the extraction phase, three seeds for splitting D θ into train and test, and three seeds for finetuning on the downstream task, which results in 27 different runs per each model.
Inference During inference, the model must identify the gold document d + amidst distractor documents d − 's.To ascertain that the model genuinely recognizes the correct answer, we employ a varied assortment of distractors.Initially, we use document type (h, r, •), ensuring all non-gold tails are included.Subsequently, we utilize the remaining non-gold document types listed in Table 1 as distractors, sampling 50 documents for each type.Lastly, we also sample 50 irrelevant but factually correct documents from the test set to assess the model's sensitivity to factual accuracy.We evaluate pre-trained models across various families: OPT (Zhang et al., 2022), GPT-Neo (Black et al., 2021), RoBERTa (Liu et al., 2019), and BERT (Devlin et al., 2019).Unless otherwise stated, we use the base size (125M) of these models.We investigate the scaling behavior of OPT and LLaMa (Touvron et al., 2023) in Section 5. We initialize the diagnostic dataset D from LAMA (Petroni et al., 2019), which has 34K facts over 40 relations.Our results are reported over 1134 finetuning runs (Refer to Appendix A for detailed hyperparameters.)

Evaluating the Knowledge Utilization
We separately report the fraction of facts (D) that can be extracted and the downstream performance of models in Figure 3. First, we find that, on par with previous work (Qin and Eisner, 2021), there is a significant gap in the encyclopedic facts the models can correctly predict and the entire facts present in the diagnostic dataset D (Figure 3a).Note that one can arbitrarily increase the number of correctly predicted by considering a prediction as correct if the gold entity is among the model's top-k predictions.However, we only consider k = 1 to only focus on the facts that the model can confidently predict.Nonetheless, we find that BERT and RoBERTa extract slightly more encyclopedic facts than GPT-Neo and OPT.
Critically, all models demonstrate a pronounced gap in downstream task performance, or knowledge utilization, (Figure 3b).This unexpected outcome occurs despite the downstream task being seemingly simple since (1) models are trained and evaluated on examples based on their accurate encyclopedic knowledge predictions, and (2) both K θ train and K θ test are sampled from the same distributions (I.I.D), so the models only encounter seen entities.Notably, OPT and GPT-Neo manage to outperform BERT and RoBERTa by a small margin.This finding suggests that models struggle to utilize their entire parametric knowledge in downstream tasks.In the next sections, we investigate the potential causes of this gap.

Role of Downstream Training Data
The effect of initial knowledge D θ As we utilize D θ to create the downstream task, examining the impact of its size (|D θ |) on knowledge utilization is crucial.If consistent behavior is observed for different sizes, it implies that the utilization gap does not stem from the amount of initial knowledge and must be rooted in inductive biases (e.g., the model or finetuning process), allowing us to measure and compare utilization with different initial knowledge.
To measure such effect, for each model, we first compute D θ , and then instead of directly using it for K θ , we sub-sample smaller sets of it at various fractions and construct the downstream task using each sub-sampled D θ .In Figure 4.a, we observe the knowledge utilization is fairly consistent (at least for fractions > 0.4) across different sizes of D θ for all models.Larger fractions seem to have less variance as well.This suggests that the utilization performance is intrinsic to the downstream knowledge transfer rather than the initial knowledge residing in the model.
The effect of the number of negatives The model learns to apply its parametric knowledge by optimizing the retrieval objective.To ensure the training signal, produced by contrastive loss on K θ train , is strong, we vary the number of negative documents for creating K θ train .If the training signal is weak, we expect knowledge utilization to improve with more negatives.
To answer this question, we follow the same setup as described in Section 2 and increase the number of negative documents sampled per type from 4 to 10.We also consider reducing it to 2 negatives per type to better understand its effectiveness.We keep the initial knowledge D θ fixed.
Figure 4.b summarizes our findings.Knowledge utilization remains the same for all models as we increase the number of negatives.This pattern is observed even with as few as two negatives per type.This suggests that the training signal is strong enough across the board and the gap in knowledge utilization is not rooted in the training objective.

Gap 1 vs. Gap 2
Findings in Section 3.1 shows that the gap in knowledge utilization (i.e.accuracy on K θ test ) does not depend on the size of D θ and is fairly consistent across different number of negatives.Moreover, we find that the variation across the random splitting of D θ to create train and test sets of the downstream task is negligible.
The robustness to such design choices allows us to define Usable Knowledge, which basically indicates the portion of facts from D that the model can actually utilize in the downstream task.We compute this metric by multiplying the accuracy on K θ test by the fraction of correctly predicted facts in D. We report the results in Figure 5.
These results clearly demonstrate that there exist two gaps in the models' knowledge.Gap 1 is in how many facts the model knows after pre-training.Gap 2 is in how many of facts the model knows can be truly utilized in downstream tasks.Indeed,  Gap 2 exists in how many of the known facts the model can actually utilize in downstream tasks (the usable knowledge).
we see that although RoBERTa manages to extract more facts than GPT-Neo, due to Gap 2, it performs the same as GPT-Neo in downstream tasks.

Robustness of Knowledge Utilization
We intentionally design the downstream task K θ to be straightforward and free of any distributional shift as we want to measure the maximum knowledge utilization of the model.However, in realworld applications, it is likely that the model encounter samples that are different from the training distribution.In this section, we investigate the robustness of knowledge application in the presence of such distributional shifts.Recall that we randomly divide D θ into two sets as the data source for the creation of K θ train and K θ test .In this experiment, however, we split D θ such that the relation types (r) in K θ train and K θ test are disjoint.Specifically, we randomly select 60% of the relations and their corresponding facts for K θ train and the rest for K θ test .We repeat this process over three seeds to create three different splits.We still follow the same procedure for converting knowledge triples to document retrieval examples as explained in Section 2. In this way, we ensure we don't change the task's nature, i.e. the model still needs to apply its parametric knowledge to solve the task, but the distributional shift between K θ train and K θ test can represent real-world scenarios.If the model learns to systematically apply its knowledge, we expect its downstream performance to be similar to or close to the I.I.D. setting (Section 3).
We observe downstream task performance drops  significantly for all models when evaluated OOD (Figure 6).This indicates the models cannot use their knowledge on examples with unseen relation types, though all relations and facts originate in D θ .Thus, knowledge usage in downstream tasks is sensitive to distribution shifts, suggesting failure to apply pre-training knowledge may be more severe in real-world applications.

Effect of Scaling law On The Gaps
Recent NLP success has come from scaling up pretraining model parameters (Brown et al., 2020).
With larger models and increased compute, capabilities such as in-context learning and chain-ofthought reasoning emerge (Wei et al., 2022b).The expanded capacity allows these models to absorb more knowledge from pre-training data, improving their usefulness as knowledge sources.However, it remains uncertain if scaling boosts the proportion of pre-training knowledge applicable to downstream tasks.Ideally, we like to see a narrowing gap in pre-training knowledge alongside superior knowledge utilization.
To investigate this, we evaluate XTRAEVAL on increasing sizes of OPT and LLaMa models.Specifically, at each scale, we first extract the model's parametric knowledge and then create the downstream task based on it using the same procedure as described in Section 2. Figure 1 reports the results of this experiment.
First, we confirm that a greater fraction of knowledge triples in D can be identified in larger models, suggesting they acquire more knowledge from pretraining data.Secondly, we find that the gap between identifiable and usable knowledge persists in larger models, and their ability to apply knowledge in downstream tasks does not improve with scaling.Figure 7 illustrates these gaps directly, demonstrating that while Gap 1 decreases in larger models, Gap 2 remains relatively unchanged.
The results suggest that while PLMs, even at small scales, pose considerable knowledge, extracting an equivalent amount of usable knowledge necessitates much larger models.For instance, OPT-125M accurately predicts 34% of encyclopedic facts, but only OPT-13B (approximately 100× larger) can reliably apply the same volume in downstream tasks.Enhanced pre-training routines, including the use of more data or higher quality data, can bolster knowledge acquisition, as is clearly demonstrated by LLaMa models.Notably, LLaMa-7B significantly outperforms OPT-13B.While LLaMa models possess a greater amount of usable knowledge due to superior initial knowledge, a gap in knowledge utilization remains discernible (Figure 7).

Discussion
Lately, pre-trained language models with chatbot interfaces have increasingly been served as knowledge bases (Ouyang et al., 2022).These chatbots typically employ the model's parametric knowledge to respond to queries and offer information.Our study examines the dependability of this knowledge and its impact on downstream task performance.We discover that, regardless of inductive biases, PLMs face difficulty utilizing their full knowledge in downstream tasks (Section 3).This unreliability of parametric knowledge could constrain the concept of "PLMs as differentiable knowledge bases." Additionally, our findings show that the utilization gap persists even with scaling (Section 5).Notably, while models at each scale capture more knowledge from pre-training data, obtaining the same amount of usable knowledge requires significantly larger models.This exposes a potential constraint in the recent trend of adopting mid-sized PLMs (Li et al., 2023).
Lastly, we discover that knowledge utilization depends on the peculiarities of finetuning data for downstream tasks.Specifically, as seen in Section 4, PLMs struggle to apply their knowledge to relation types not encountered during finetuning, even if they accurately predicted such facts in step 1.This generalization gap could highlight challenges within the recent SFT-RLHF paradigm (Ouyang et al., 2022).For instance, the model may only adhere to instructions and excel at tasks resembling the finetuning data.Consequently, it might be necessary to meticulously craft finetuning data to activate and utilize all aspects of parametric knowledge in downstream tasks.However, it requires elaborate studies to establish the systematic issues in knowledge application beyond encyclopedic knowledge like procedural and task knowledge.

Related Work
Parametric Knowledge Petroni et al. (2019) constructed a probing dataset to measure the factual knowledge present in PLMs.They showed that many encyclopedic facts can be extracted without further training of the model and proposed PLMs as a new type of knowledge base, which can be trained on the unstructured text and queried using natural language.Follow-up work improves the methods for probing and extracting world knowledge from PLMs (Jiang et al., 2020;Shin et al., 2020;Qin and Eisner, 2021;Newman et al., 2022).Apart from encyclopedic facts, studies have explored PLMs' parametric knowledge in other areas, such as linguistic structures (Tenney et al., 2019b;Blevins et al., 2022), and commonsense (Zhou et al., 2020;Liu et al., 2022a).Recently, the emergent abilities of LLMs have shown that they acquire skills like coding (Chen et al., 2021), reasoning (Chowdhery et al., 2022), and in-context learning (Brown et al., 2020), in addition to the previously mentioned knowledge.
Using the Parametric Knowledge Roberts et al. ( 2020) finetune a pre-trained T5 model for question answering in a closed-book setting and showed that it can perform on par or better than models that use explicit knowledge bases.Wang et al. (2021) made a similar observation for the BART model.More recently, PLMs are being used to generate facts and documents for knowledge-intensive tasks (Li et al., 2022;Liu et al., 2022b;Yu et al., 2023).In this paradigm, in order to answer factual questions, instead of retrieving relevant documents, the model has to first generate the facts and then answer the question with those facts as context.This paradigm shows that the models may not be able to use their parametric knowledge on their own and need explicit grounding to be able to use it.Furthermore, there is a plethora of work that investigates whether the model employs its linguistic knowledge when solving downstream language understanding tasks.McCoy et al. (2019) shows that RoBERTa does not use its linguistic knowledge for solving NLI.Instead, it relies on shallow heuristics.Lovering et al. (2021)'s observation aligns with this finding and shows the training data used for the downstream task needs to have enough evidence to trigger the model's linguistic knowledge.In our work, we use a more general notation of parametric knowledge and investigate utilization in cases where sufficient evidence is present in the finetuning data.

Conclusion
In this study, we presented EXTRACT, TRAIN, AND EVALUATE (XTRAEVAL ), a framework designed to assess the parametric knowledge of pretrained language models.Employing XTRAEVAL we identified a previously unnoticed gap in what models know and how much of it they can actually use.Our findings reveal that this gap exists not only in smaller models but also persists in larger ones.Additionally, we demonstrate that a distributional shift in finetuning data can result in even larger gaps between the model's knowledge and its practical application in downstream tasks.

Limitations
Although XTRAEVAL is agnostic to the specific type of parametric knowledge, our work primarily focuses on encyclopedic facts as one type of world knowledge that PLMs can acquire.It is plausible that similar results would hold for other knowledge types, however, further work is needed for a precise investigation.
While there are various downstream tasks that could be evaluated, we primarily focus on document retrieval as it allows us to systematically demonstrate the key issue of knowledge application that we aim to highlight.We also acknowledge that our study was limited to a few model families and parameter scales due to compute constraints.However, our evaluation protocol is model agnostic, enabling future work to explore this phenomenon on other tasks and with different models.

Figure 1 :
Figure 1: Parametric knowledge of PLMsGap 1 represents the missing facts in the model's parametric knowledge (what the model knows).Gap 2 exists in how much of this knowledge can actually be utilized in downstream tasks (the usable knowledge).We find that although the first gap mostly shrinks, the second remains as we increase the model's size.

Figure 3 :
Figure 3: (a) The fraction of encyclopedic facts the pre-trained LM can predict correctly without any training.Reported over three seeds (standard deviation σ ≤ 0.004 for all models).(b) The model performance in downstream task (created based on correctly predicted facts) measured as top-1 retrieval accuracy.Averaged over 27 runs per each model (σ ≤ 0.011 for all models).Refer to Appendix B for detailed results.

Figure 4 :
Figure 4: (a) Knowledge utilization when using different fractions of parametric knowledge to create the downstream task.(b) The effect of number of negative training documents (d − 's) used for creating the downstream task.

Figure 5 :
Figure 5: Gaps in parametric knowledge Gap 1 represents the missing facts in parametric knowledge D θ (what the model knows).Gap 2 exists in how many of the known facts the model can actually utilize in downstream tasks (the usable knowledge).

Figure 6 :
Figure 6: Robustness to distributional shift In the OOD setting, we produce a distributional shift (over the relation types) between the examples in the train and test set of the downstream task K θ .All models fail to generalize to unseen relations.The IID setting is the same as the one described in Section 2 and repeated from Figure 3b for comparison.
Encyclopedic Fact: x = ⟨h, r, t⟩ = ⟨Barack Obama, GraduatedFrom, Harvard⟩ Table 1: All possible inputs to the document generator P(d | H, R, T ) per each fact x and examples of the corresponding sampled documents.The dot means that the corresponding entity or relation is not given, and the document generator will randomly choose it from D θ .The gray text provides an explanation of the sampled document.Note that we do not force the document generator to generate a factual document and the model itself has to predict the relevancy of each document.whered depends on the head entity h, relation r, and tail entity t.The document generator, P(d | •), selects a template at random and fills in the blanks with the input entities.If H, R, or T are missing, the generator chooses a random entity from D θ to complete the input.Specifically, we generate relevant document d + by sampling from P(d | •) with gold entities in x as input, and create irrelevant documents d − by omitting one or more entities.Each k comprises a tuple (q, {d + , d − 1 , . . ., d − m }), where m is the number of irrelevant documents.