Pre-training Language Models with Deterministic Factual Knowledge

Previous works show that Pre-trained Language Models (PLMs) can capture factual knowledge. However, some analyses reveal that PLMs fail to perform it robustly, e.g., being sensitive to the changes of prompts when extracting factual knowledge. To mitigate this issue, we propose to let PLMs learn the deterministic relationship between the remaining context and the masked content. The deterministic relationship ensures that the masked factual content can be deterministically inferable based on the existing clues in the context. That would provide more stable patterns for PLMs to capture factual knowledge than randomly masking. Two pre-training tasks are further introduced to motivate PLMs to rely on the deterministic relationship when filling masks. Specifically, we use an external Knowledge Base (KB) to identify deterministic relationships and continuously pre-train PLMs with the proposed methods. The factual knowledge probing experiments indicate that the continuously pre-trained PLMs achieve better robustness in factual knowledge capturing. Further experiments on question-answering datasets show that trying to learn a deterministic relationship with the proposed methods can also help other knowledge-intensive tasks.

1 Introduction Petroni et al. (2019); Jiang et al. (2020); Shin et al. (2020); Zhong et al. (2021) show that we can successfully extract factual knowledge from Pretrained Language Models (PLMs) using cloze-style prompts such as "The director of the film Saving Private Ryan is [MASK]."Some recent works (Cao et al., 2021;Pörner et al., 2020) find that the PLMs may rely on superficial cues to achieve that and can not respond robustly.Table 1 gives examples of inconsistent predictions exposed by changing the surface forms of prompts on the same fact.
This phenomenon questions whether PLMs can robustly capture factual knowledge through Masked Language Modeling (MLM) (Devlin et al.,

Cloze-style Prompt and Prediction
Is Correct?
War Horse is an American war film directed by Steven Spielberg.✓ The director of the American war film War Horse is Keanu Reeves.✗ Christopher Nolan is the director of the American war film War Horse.✗ Table 1: A PLM could gives inconsistent results when probing the same fact with different prompts.The underlined words are the predictions.
2018) and further intensify us to inspect the masked contents in the pre-training samples.After reviewing several masking methods, we find that they focus on limiting the granularity of masked contents, e.g., restricting the masked content to be entities and then randomly masking the entities (Guu et al., 2020), and pay less attention to checking whether the obtained MLM samples are appropriate for factual knowledge capturing.For instance, when we want PLMs to capture the corresponding factual knowledge as recovering the masked entities, we should check whether the remaining context provides sufficient clues to recover the missing entity.Inspired by the above analysis, we can categorize MLM samples based on the relationship between the remaining context and masked content: • Non-deterministic samples The clues in the remaining context are insufficient to constrain the value of the masked content.Multiple values are valid to fill in the masks.
• Deterministic samples The remaining context holds deterministic clues for the masked content.We can get one and only one valid value for the masked content.For example, the first cloze in Table 1 masks the director of the film "War Horse."Since the film has only one director in the real world, we can get a unique answer deterministically.So it is a deterministic MLM sample.The crucial clues "War Horse" and "directed by" have a deterministic rela-tionship with the missing entity "Steven Spielberg."For brevity, we refer to these clues as deterministic clues and the outcome "Steven Spielberg" as deterministic span.In contrast, if the sample becomes " [MASK]s is an American war film directed by Steven Spielberg," multiple names can fill the masks because Steven Spielberg produced more than one American war film.We cannot tell which one is better based on the existing clues, so it is a non-deterministic sample.
The non-deterministic samples establish a multilabel problem (Zhang and Zhou, 2006) for MLM, where more than one ground-truth value for outputs is associated with a single input.If we enforce the PLMs to promote one specified ground truth over others, the other ground truths become false negatives that could plague the training or cause a performance downgrade (Durand et al., 2019;Cole et al., 2021).The non-deterministic samples are competent for obtaining contextualized representations but become questionable for understanding the intrinsic relationship between factual entities.In contrast, the deterministic samples are less confusing since the answer is always unique, providing a stable relationship for PLMs to learn.Therefore, we propose deterministic masking that always masks and predicts the deterministic spans in MLM pre-training to improve PLMs' ability to capture factual knowledge.The deterministic clues and spans are identified based on a KB.Two pre-training tasks, clue contrastive learning and clue classification, are introduced to make PLMs more aware of the deterministic clues when predicting the missing entities.The clue contrastive learning encourages PLMs to be more confident in prediction (Vu et al., 2019;Luo et al., 2021) when the deterministic clues are unmasked.The clue classification is to detect whether the remaining context contains deterministic clues.The experiments on the factual knowledge probing and question-answering tasks show the effectiveness of the proposed methods.
The contributions of this paper are: (1) We propose to model the deterministic relationship in MLM samples to improve the robustness (i.e., both consistency and accuracy) of factual knowledge capturing.(2) We design two pre-training tasks to enhance the deterministic relationship between entities to earn further improvement on robustness.
(3) The experiment results show that learning the deterministic relationship is also helpful for other knowledge-intensive tasks, such as question answering.

Methods
Section 2.1 expatiates the deterministic masking, which includes how we align texts with triplets and identify deterministic clues and spans in texts.The clue contrastive learning and clue classification are described in Sections 2.2 and 2.3, respectively.

Deterministic Masking
In addition to masking only factual content, the deterministic masking also constrains the remaining context and the masked content to have a deterministic relationship: the remaining context should provide conclusive clues to predict the masked content, and the valid value to fill in the mask is unique.
To this end, we align each text with a KB triplet and match the spans in the text with (subject, predicate, object) respectively.We select the spans aligned with objects as the candidates to be masked for pre-training.To further make the masked object deterministic, we query the KB with the aligned (subject, predicate) and check whether the valid object that exists in KB is unique.
If the KB emits this object exclusively, e.g., only the aligned object can compose a valid triplet with the aligned subject and predicate, the object is deterministic.The object is non-deterministic if multiple objects suit the aligned subject and predicate in the KB.The span aligned with the deterministic object is a deterministic span, and it would be masked to construct a deterministic MLM sample1 .We pre-train PLMs on only the deterministic samples.
Figure 1 shows a deterministic sample aligned with the triplet ("War Horse," "directed by," "Steven Spielberg").When querying KB with "War Horse" as the subject and "directed by" as the predicate, the result object "Steven Spielberg" is unique because there is only one director who produced this film, so the first sample is deterministic.In contrast, when using "Steven Spielberg" and "directer of " as the subject and the predicate, multiple valid objects exist in KB, so the second sample is non-deterministic and is filtered out.
By dropping the non-deterministic samples, we prevent PLMs from having a crush on one object but ignoring others that are also valid based on the existing clues.While in the deterministic samples, the relationship between the remaining clues and the missing span is more stable and unambiguous.
Training on the deterministic samples encourages PLM to infer the missing object based on its deterministic factual clues.It helps PLMs grasp a more substantial relationship between entities to model the factual contents and could aid in accomplishing some knowledge-intensive tasks.

Clue Contrastive Learning
To stimulate PLMs to catch the deterministic relationship between entities, we design the pretraining task clue contrastive learning following this intuition: PLMs should have more confidence to generate a masked span when its deterministic clues exist in the context, and introduce a contrastive objective accordingly.We explain it with a pair of samples in Figure 2. Figure 2a shows a deterministic MLM sample that masks the span "Steven Spielberg" and keeps its deterministic clues.to be maximized is: This task encourages PLMs to give the ground truth ô a higher probability when the deterministic clues exist in the context.It is somewhat conservative since we consider the noise in the data construction.The objective is still reasonable even when the S, P , and R are randomly labeled.Raw words are always more informative than the ordinary [MASK]s and can reduce the uncertainty of the context (Cover, 1999), so the uncertainty of prediction degrades accordingly (Vu et al., 2019;Luo et al., 2021).On the other hand, this objective trains PLMs to react to the changes in the context, i.e., learning how to tune the output as the input changes.We employ a large-scale KB as the approximation of real-world knowledge (Reiter, 1981) to get the pre-training samples.

Clue Classification
The The PLMs that use the Transformer encoder (Vaswani et al., 2017) as the backbone emit a contextualized embedding for each input token.Every contextualized embedding can encode all the information in context since all the input tokens are involved when computing the embedding.So we encode the clues in the remaining context using the contextualized embedding of O. Formally, the clue representations for the above three samples are: Each token-level embedding vector e in E (a) , E (b) , and E (c) is fed into a three-way classifier: (3) y is a three-element vector and shows the probabilities that e comes from E (a) , E (b) and E (c) .W ⊤ is the three-way classifier.
The number of masked tokens in samples (a) and (b) is different since the latter masks the clues additionally.It may become a shortcut for the proposed contrastive or classification tasks.So we introduce sample (c), which has the same number of masks as sample (b), to eliminate this shortcut.Some existing pre-training methods (Clark et al., 2019;Xiong et al., 2019;He et al., 2020) replace original tokens with fake tokens to build the pseudo samples and then classify between original and pseudo samples, while clue classification employs [MASK]s in the replacement.We wonder that fake tokens may make the intervened input tell fake facts conflict with the real world, leading the PLMs to capture wrong knowledge from the pseudo samples.[MASK] is a safer choice here.

Pre-training Data
We use Wikipedia2 as the source of texts and Wikidata3 as the knowledge base.We split the Wikipedia texts into natural paragraphs and then align each paragraph with subject-predicate-object triplets in WikiData.Each aligned object that is deterministic based on WikiData is the deterministic span, and all the subjects and predicates that correspond to the deterministic span are deterministic clues.The paragraphs with identified deterministic spans and clues are then used for pre-training.
We first employ the TREX (Elsahar et al., 2018) that provides the alignments between texts and triplets to construct a preliminary dataset named Partial data.About 46.1% of triplets are nondeterministic and ignored in the partial data.TERX provides aligned triplets for only the abstract paragraphs (first paragraph) in Wikipedia.We enlarge the data size by processing all the paragraphs in Wikipedia.The detailed process is in Appendix B. The dataset that involves all the paragraphs is referred to as Full data.

Evaluation Tasks and Datasets
We first evaluate the proposed method with clozestyle QA.Then We adopt two other knowledgeintensive tasks, closed-book QA and extractive QA, to evaluate the PLMs' ability to capture and understand factual knowledge.Every cloze-style question is obtained by instantiating a artificial template on a fact.For example, the question "Keanu Reeves is a citizen of [MASK]" is constructed based on the template "[X] is a citizen of [Y]" and the triplet (Keanu Reeves, citizen of, Canada).Filling the mask with the correct token "Canada" is regarded as successfully capturing the corresponding fact.
We use the cloze-style questions and evaluation metrics from PARAREL (Elazar et al., 2021).PARAREL queries every fact with 8.64 different prompts on average.The prediction consistency is calculated by putting two different prompts into a prompt pair first (e.g., n(n − 1)/2 pairs can be obtained from n different prompts).Then the percentage of the prompt pairs that can obtain the same result is used to indicate the consistency quantitatively.The overall factual knowledge capture performance is measured by jointly considering the prediction accuracy and consistency.Table 3 shows the statistics of the data from PARAREL.
We also employ four cloze-style datasets from   (Cao et al., 2021).(Cao et al., 2021).Table 4 shows the corresponding statistics.LAMA represents the cloze-style datasets that are similar to (Petroni et al., 2019) in (Cao et al., 2021).The distribution of the groundtruth answers in LAMA is imbalanced, providing a shortcut for PLMs to achieve good performance by selecting high-frequency entities as output.Therefore, Cao et al. (2021) proposes the WIKI-UNI dataset, where the distribution of the ground-truth answers follows a uniform distribution.The ground-truth answer has literal overlaps with the question sometimes.For example, in "New York University is located in [MASK] city," the right answer "New York" exactly leaked in the question.So Cao et al. (2021) filter out the questions that overlap with answers and obtain two more datasets based on LAMA and WIKI-UNI, which are indicated with the suffix "w/o leakage".

Closed-book QA
We also use the closed-book QA to test the ability of PLMs to capture factual knowledge.As proposed in (Wang et al., 2021;Roberts et al., 2020;Lewis et al., 2021), closed-book QA is similar to the way in which a student is taking a closed-book exam.The input of the model is the question only, and the model needs to generate the answer directly without seeing any other evidence.This task needs the model to generate arbitrary strings as answers, so we employ the BART, which has a decoder that can generate texts, as the base model for this task.We use the closed-book QA datasets from (Karpukhin et al., 2020), as shown in Table 5.

Extractive QA
The extractive QA is also known as machine reading comprehension (Liu et al., 2019a).The task is to search and extract the answer span from the input passage for the input question.It evaluates the ability of PLMs to understand the facts provided in passages.We employ the six extractive QA datasets from MRQA (Fisch et al., 2019), Table 6 presents the summary of the datasets.Following the setting in (Ram et al., 2021a), we use the development sets for testing.

Baselines
We continuously pre-train RoBERTa-base (Liu et al., 2019b) and BART-base (Lewis et al., 2020) from their official checkpoints with different masking methods: • Random token: Mask random tokens in the tokenized text (Devlin et al., 2018).
• Whole word: Mask random words.All the tokens in the randomly selected words are masked at once (Cui et al., 2019).
• Salient span: Mask a span aligned with the subject or object, both deterministic and nondeterministic samples are included.(Guu et al., 2020).
• Deterministic: The proposed deterministic masking that masks a deterministic object, including only deterministic samples.The above four models are trained with the maskfilling task only.The models pre-trained with clue contrastive learning and clue classification in com-pany with deterministic masking are denoted as "+ Con & Cls."To further explore the potential of the proposed methods, we train the model on the full data with all the proposed methods, denoted as "+ Full data".We also introduce KEPLER-base as KB-enhanced baseline for comparison.

Cloze-style QA
Masking strategies Tables 7 and 8 present the results on cloze-style QA.We can see that random token masking can gain some improvements in performance, as well as the whole word masking.We think this is because the input texts, which come from Wikipedia, are formal descriptions of facts.Training on such texts helps shift the domain of PLMs for better generating factual words.The random token and whole word masking serve as solid baselines to focus the comparison between masking strategies, eliminating the confounders brought by extra pre-training on Wikipedia.
The salient span masking and deterministic masking both mask entity spans.The difference is that deterministic masking further limits the relationship between the remaining context and the masked span to be deterministic, driving PLMs to learn to infer based on the deterministic clues.The results show that the PLMs can achieve much better results with deterministic masking, indicating that the deterministic relationship is valuable for recovering factual spans robustly.The proposed pre-training tasks The clue contrastive learning and the clue classification, which aim to strengthen the deterministic relationship, also provide further performance improvements (denoted as + Con & Cls).Finally, the full data with all the proposed methods brings the most significant improvement.The proposed pre-training models also outperform the KEPLER-base.

Out-of-domain evaluation
To analyze the improvement in-depth, we split the probing questions into in-domain and out-of-domain according to whether the pre-training corpus covers the corresponding triplets in questions.As Table 7 shows, the three random-based masking methods (Random token, Whole word, and Salient Span) boost performance on in-domain questions but get stuck on the out-of-domain questions.It is natural that the PLMs can answer the questions that are involved in pre-training.Surprisingly, although the out-of-domain questions are inaccessible in the pre-training corpus, the deterministic masking also gains performance improvement (3-4%), indicating  Table 8: The results on cloze-style QA datasets from (Cao et al., 2021).The performance is measured by accuracy5 .
that the deterministic relationship could help PLMs to better recollect the facts learned implicitly.

Closed-book QA
We fine-tune the continuously pre-trained BARTbase on the Closed-book QA task.The metrics are EM(Exact Match) and F1 from (Rajpurkar et al., 2016).Table 9 shows the comparison results of different strategies.
Closed-book QA is more difficult than clozestyle QA since the models need to generate answers without any extra hints, e.g., the answer length is indicated by the number of [MASK]s in the clozestyle QA, while the models need to predict the answer length in closed-book QA.The input-output format of closed-book QA differs from per-training, so we need to fine-tune PLMs to recall facts based on natural questions to fit this format.Table 9 shows the evaluation results.Generally, the proposed methods outperform the baselines, demonstrating that the proposed methods can help the PLM that uses encoder-decoder architecture to capture and recall factual knowledge.

Extractive QA
We fine-tune the models that based on RoBERTabase for extractive QA.Following (He et al., 2020;Joshi et al., 2020), we employ the MRQA data with two different settings: (a) Separate: the models are trained and tested on every QA dataset separately, (b) Combine: all the training samples from the six datasets are merged in training.Then the fine-tuned models are evaluated on each dataset respectively.Table 10 shows the evaluation results, the metrics are averaged over the six development sets.
In extractive QA, the input includes a question and the supporting evidence to answer it.So the models do not have to recollect the essential evidence but should put more effort into understanding the evidence.Table 10: The performance on the extractive QA task.
Due to the difference in the input-output format between the MLM and span extraction task, the change in the masking methods has somewhat limited effects on the performance here.The averaging on six different MRC datasets and the hyperparameter search (in Appendix C) could further diminish the performance difference between the models.However, the proposed methods still show slight advantages in the comparison, demonstrating that learning the deterministic relationship could also help to comprehend factual knowledge.

Related Work
Pre-training on large-scale unlabeled text can help PLMs capture meaningful knowledge and benefits the downstream tasks accordingly (Brown et al., 2020;Radford et al., 2018).BERT (Devlin et al., 2018) proposes a Mask Language Model (MLM) in which the model needs to recover some masked tokens based on the remaining context.The effectiveness of the MLM makes BERT become the starting point for fitting many downstream tasks (Chen et al., 2020).Afterward, several different masking methods (Cui et al., 2019;Joshi et al., 2020;Levine et al., 2020;Sun et al., 2019) have explored how masking methods affect performance and have obtained further performance improvement.These works push the limit of MLM and show the importance of designing better masking strategies.
On the other hand, some pre-training tasks other than MLM have been proposed.Clark et al. (2019) trains the model to distinguish the replaced words from the original words in the context.(Xiong et al., 2019;He et al., 2020) let factual spans be the replacement candidates.Qin et al. (2020) contrasts the representations between different entities and relations.This paper views another perspective of the masking methods: whether the remaining context can uniquely determine the masked span.Accordingly, we propose a deterministic masking strategy that masks deterministic spans in MLM samples.Moreover, we design clue contrastive learning and clue classification as pre-training tasks to help PLMs identify the deterministic clues for the masked span and contrast them with the nondeterministic ones.Moreover, we evaluate the performance of the proposed model with various downstream tasks.

Conclusion
This paper proposes to train PLMs to learn a deterministic input-output relationship in MLM to improve PLMs on capturing factual knowledge.The deterministic relationship ensures the masked content in MLM samples is deterministically predictable based on the remaining context.To further enhance the deterministic relationship, we design a pre-training task clue contrastive learning that encourages PLMs to give more confident predictions when the input keeps deterministic clues, and the clue classification to train PLMs to predict whether the deterministic clues exist.Extensive experiments show that the proposed methods can improve the accuracy and consistency of factual knowledge capturing and boost the performance of the other two knowledge-intensive tasks.

Limitations
We summarize this paper's main limitations as follows: First, this study focuses on enhancing the deterministic relationship but does not explore the non-deterministic relationships.The other nondeterministic relationships also play essential roles in tasks such as semantic role labeling and emotion recognition, where the proposed methods may not be helpful.Second, due to the diversity and richness of natural language, we cannot perfectly recognize the deterministic clues and spans from texts.We have to consider the noises in recognization when designing the pre-training tasks.Finally, we continuously pre-train PLMs on only Wikipedia text, somewhat narrowing down their domain.Constructing more pre-training samples by the proposed procedure (Procedure 1 in the Appendix) could be better.Moreover, we can use the current pre-training samples (based on Wikipedia) to train an "interpolation model" that can tag the deterministic clues and spans in the input texts.The interpolation model can also be used to enlarge the pre-training data.Table 11: The evaluation results on PARAREL."Object" denotes the baseline that masks and predicts the objects."Deterministic" denotes the MLM baseline that uses deterministic masking."+ Con" is the baseline that uses the clue contrastive learning with deterministic masking.

Appendix A Ablation Study
How does masking objects help in factual knowledge capturing?As described in Section 3.2.1, the cloze-style questions that we use to probe the factual knowledge in PLMs, are constructed by integrating subject-predicate-object triples with artificial templates.Due to the conventions in the template construction, the objects could have more opportunities to be the answer than the predicates and subjects.So focusing on recovering objects in pre-training may also benefit cloze-style QA.The proposed deterministic masking naturally masks more objects in pre-training because of the rules we designed to identify the deterministic span.
Both masking objects and the deterministic relationship could bring improvements in the deterministic masking.
To investigate and clarify their contributions, we introduce a baseline "Object" that masks and predicts only object spans in pre-training.Table 11 shows the evaluation result.We can see that the Object baseline performs better than the Salient span baseline on factual knowledge capture.It reveals that masking objects indeed improve performance.Nevertheless, the deterministic masking (denoted as "Deterministic") achieves better results, denoting that both masking objects and learning the deterministic relationship contribute positively to factual knowledge capture.The effectiveness of the clue contrastive learning and clue classification To reveal the contribution of the proposed pre-training tasks separately, we introduce a baseline that only uses the clue contrastive learning.We refer to it as "+ Con" in Table 11."+ Con & Cls" denotes the PLM that uses the clue classification in addition to the clue contrastive learning.We can see that the performance increases as we apply the proposed methods incrementally.The improvements on deterministic and nondeterministic cloze-style questions The dataset from PARAREL includes only the N-1 relations6 .While the LAMA dataset from (Petroni et al., 2019) (Tables 4 and 8) includes both the N-1/1-1 and N-M relations.To reveal the improvements in terms of relation types, we separate the samples into N-1/1-1 and N-M based on the relation types and report the results separately, as shown in Table 12.Similar to the deterministic relationship we used, the groundtruth object is unique when the relation type is N-1 or 1-1.The results show that the improvement for the N-1/1-1 relations is more significant than the N-M relations when using the proposed methods.

B Pre-training Data Construction
Procedure 1 shows how we construct the pretraining data, including entity linking, predicate matching, triplet aligning, and deterministic relationship checking.Each text piece t is a paragraph in Wikipedia.We use the entity linker provided in (van Hulst et al., 2020), represented as EN-TITYLINKER, to identify all the entities in the paragraph 7 .The WikiData defines 12,043 aliases for 8,529 predicates.Function PREDICATEMATCHER searches the substring corresponding to a predicate by comparing the predicate's aliases with all the substrings in the text.The identified predicate span is the nearest match whose edit distance is less than Table 12: The detailed results on the LAMA dataset in (Cao et al., 2021), reported separately with respect to relation types: N-1/1-1 or N-M. two.
After recognizing the entities and predicates, we combine every entity pair with every predicate as a triplet (entity, predicate, entity), enumerate all the possible combinations, and keep the ones that existed in KB as the triplets aligned with the paragraph.Line 7 checks if the object is deterministic by querying KB.We record the obtained deterministic sample in the format of (subject s, predicate p, object o, text t, and entities E).
The baselines use the pre-training sample as the following: • Deterministic • Salient Span (mask entities randomly): Get a sample (s, p, o, E, t) from D ssm , randomly mask an aligned entity in E.Although the masking granularity is different in the baslines, we keep the length of the masked content as similar as possible for a fair comparison.
Then we introduce how the two proposed pretraining tasks use the data.In the clue contrastive learning, the s and p are the deterministic clues and masked in the contrastive sample.If the same o have more than one deterministic clue in t, e.g., multiple deterministic clues for the same o are given by different triplets, all the deterministic s and p are considered as determined clues and masked in the contrastive sample (b).In the clue classification, the number of the randomly masked token in the sample (c) is the same as the constrastive sample.
Procedure 1 is used to generate the full data (summarized in Table 2).We obtain the partial data similarly, except that we do not need the EN-TITYLINKER (at Line 3 in Procedure 1) and directly use the entity-text alignments provided in TREX.

C Hyperparameters
Pre-training For the baselines trained on the partial data, the batch size is set to 512, the learning rate is 3 × 10 −5 , and the number of total training steps is 50,000.There are 200,000 training steps for the final model on the full data.Extractive QA In the experiments on extractive QA, we find that the model's performance is sensitive to hyperparameters.We conduct a grid search over the learning rate and batch size.In the separate setting, the learning rate is searched over 1 × 10 −5 , 2 × 10 −5 , 3 × 10 −5 , the batch size is searched over {16, 32, 64} when fine-tuning every PLM on every dataset, and the epoch is set to 4. In the combine setting, the learning rate ∈ 2 × 10 −5 , 3 × 10 −5 , the batch size ∈ {32, 64}, and the epoch is set to 2. We save the model checkpoint per 5,000 steps.The best model is selected from evaluating all the checkpoints.We use the code released by (Ram et al., 2021b) 8 .Closed-book QA The learning rate is set to 5 × 10 −5 .The training steps is set to 100,000 for Natu-ralQuestions and TriviaQA, 40,000 for WebQues- Figure 2b masks both the deterministic clues and the deterministic span.The remaining context in Figure 2a contains fewer [MASK]s and provides more information, naturally reducing the uncertainty in prediction.So PLMs should assign a higher probability for the ground truth when giving the context in Figure 2a than Figure 2b.Formally, we use S and P to denote the deterministic clues (subject and predicate) and O to denote the masked deterministic span (object).R represents the random spans in the context other than S, P , and O.The objective function that needs ... War Horse is an American war film directed by [MASK]s ... Deterministic sample: masks the deterministic span (object) and keeps the deterministic clues (object and predicate).
...[MASK]s is an American war film [MASK]s [MASK]s ... Contrastive sample without deterministic clues: masks both the deterministic clues and the deterministic span.

Figure 2 :
Figure 2: The two samples in clue contrastive learning.The first sample (a) has a more informative context, so PLM should be more confident when predicting the masked object O.The texts with purple background denote the spans other than entities and relations.

S
= [MASK] and P = [MASK] denote replacing the deterministic clues with [MASK]s.ŝ, p, ô and r are the ground-truth values of the S, P , O and R, respectively.P (O = ô | •) denotes the probability that the PLM correctly predicts the masked span O, i.e., the average probability that the PLM assigns to the ground-truth tokens.It is calculated by a Language Model Head (LM Head) based on the embedding of O from the PLMs.

Figure 3 :
Figure 3: A sample that masks some random spans R (with purple background) in the context.

E
(a) , E (b) , and E (c) represent the inputs that (a) keep deterministic clues S and P , (b) mask deterministic clues, and (c) mask random spans R, respectively.PLM(•) denotes the PLM that can output the contextualized embedding for each input token.PLM(•) [O] denotes grabbing the embedding vectors corresponding to O from the PLM's output.Since the three input samples have the same O, the number of token-level embedding vectors is the same in E (a) , E (b) , and E (c) .
3.2.1 Cloze-style QAFollowingElazar et al. (2021);Cao et al. (2021);Petroni et al. (2019), we use cloze-style questions to probe the factual knowledge in PLMs.The PLMs need to recall the captured factual knowledge to fill in the masks.Cloze-style QA uses the same input-output format as MLM.So we do not need to fine-tune the PLMs and evaluate the factual knowledge capture performance directly.
(mask deterministic object to train MLM): Get a sample (s, p, o, E, t) from D d , mask the span corresponding to o and train PLMs to predict o based on the remaining context.• Random (mask tokens randomly): Get a sample from D d , tokenize t and calculate the number of tokens in o, denoted as TOKENCOUNT(o), randomly sample TOKENCOUNT(o) tokens to be masked in the MLM training.• Whole word (mask whole words randomly): Get a sample from D d , calculate the num-ber of words in o (separated by space), denoted as WORDCOUNT(o), randomly mask WORDCOUNT(o) words in t.

Knowledge Base Triplet (subject , predicate , object) subject predicate object 1. War Horse is an American war film directed by [
MASK]s.2.[MASK]s is an American war film

Table 2 :
Table 2 shows the statistics of the two pre-training datasets.For efficiency reasons, we use the partial data to train the baselines in the ablation study.The full data is for our final model.Statistics of the pre-training data.

Table 4 :
Statistics of the four cloze-style QA datasets from

Table 5 :
Statistics of the closed-book QA datasets.

Table 6 :
Summary of the extractive QA datasets.

Table 7 :
The factual knowledge capturing performance, evaluated by the cloze-style QA dataset PARAREL.Acc. is the accuracy, Consis.denotes the prediction consistency when changing the prompts, and Joint denotes the metric that jointly measures accuracy and consistency.Out-of-domain represent the set of questions whose triplets do not appear in the pre-training.
Table 10 shows the evaluation results.

Table 9 :
The performance on the closed-book QA datasets.
Procedure 1 Deterministic Sample Construction Require: Text collection T , Knowledge base K, ENTITYLINKER Output: Deterministic sample collection D d Output: Salient span masking sample collection D ssm 1: D d ← {}, D ssm ← {} 2: for all text piece t in T do for all predicate r that can connects (e i , e j ) do ▷ Triplet (e i , r, e j ) exists in K 7:if e j has only one match when querying K with e i and r as the subject and predicate then for all alias string a for r in K do ▷ WikiData holds a alias string collection for every predicate