How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis

Recently, there has been a trend to investigate the factual knowledge captured by Pre-trained Language Models (PLMs). Many works show the PLMs’ ability to fill in the missing factual words in cloze-style prompts such as ”Dante was born in [MASK].” However, it is still a mystery how PLMs generate the results correctly: relying on effective clues or shortcut patterns? We try to answer this question by a causal-inspired analysis that quantitatively measures and evaluates the word-level patterns that PLMs depend on to generate the missing words. We check the words that have three typical associations with the missing words: knowledge-dependent, positionally close, and highly co-occurred. Our analysis shows: (1) PLMs generate the missing factual words more by the positionally close and highly co-occurred words than the knowledge-dependent words; (2) the dependence on the knowledge-dependent words is more effective than the positionally close and highly co-occurred words. Accordingly, we conclude that the PLMs capture the factual knowledge ineffectively because of depending on the inadequate associations.


Introduction
Do Pre-trained Language Models (PLMs) capture factual knowledge? LAMA benchmark (Petroni et al., 2019) answers this question by quantitatively measuring the factual knowledge captured in PLMs: query PLMs with cloze-style prompts such as "Dante was born in [MASK]?" Filling in the mask with the correct word "Florence" is considered a successful capture of the corresponding factual knowledge. The percentage of correct fillings over all the prompts can be used to estimate the amount of factual knowledge captured. PLMs Figure 1: The associations we investigated. The underlined words are the missing words that need to be generated. The bold words, which hold specific associations with the missing words, are considered as the word-level patterns that PLMs may use to generate the missing words. Figure 2: The overview of the proposed analysis framework. The dependence measure quantifies how much the PLMs depend on each association to capture factual knowledge when per-training. The effectiveness measure evaluates whether the dependence on an association is good for the factual knowledge performance in probing.
Definition 3 Highly Co-occurred (HC): The remaining words have a higher co-occurrence frequency with the missing words.
Question 1 investigates how much PLMs depend on a specific group of remaining words to predict the missing words in pre-training samples. We select the remaining words to be investigated according to their association with the missing words. We propose a causal-inspired method to quantify the word-level dependence in each sample. The average dependence on the remaining words that hold the same association with the missing words over all the samples indicates how PLMs rely on this association to predict the missing words. We refer to this average dependence as dependence on the association. The above analysis is named dependence measure.
In Question 2, we reveal the effectiveness of dependence by the correlation between the quantified dependence on associations and the factual knowledge capturing performance. The performance is probed with additionally crafted clozestyle prompts (Elazar et al., 2021a). The more the dependence on an association positive correlates with the probing performance, the more effective this association is. We refer to the second analysis as effectiveness measure. According the experiment results, we have the following observations: Observation 1 The PLMs depend more on the positional close and highly co-occurred associations than the knowledge-dependent association to capture factual knowledge.
Observation 2 Depending on the knowledgedependent association is more effective for factual knowledge capture than positional close and highly co-occurred associations.
By connecting the two observations, we can answer the question of "how PLMs capture factual knowledge" as: The PLMs are capturing factual knowledge ineffectively since the PLMs depend more on the PC and HC association than the KD association.
The contribution of this paper can be summarized as follows: (1) We quantify the word-level dependence for mask filling with a causal-inspired method, revealing the word-level patterns that PLMs use to predict the missing words quantitatively.
(2) We compare the effectiveness of the dependence on different associations, which provides direct insights for improving PLMs for factual knowledge capture. (3) This paper introduces causal theories into PLMs by formulating the effect measurement process in mask language modeling. It paves the path to measure the causal effects between entities or events described in natural language.

Overview
We take a quick overview of our two-fold analysis with a running example in Figure 2. Figure 2a illustrates how to measure the dependence on the remaining words "Columbus" and "died" when predicting the missing words "20 May 1506." We let the PLM generate the missing words based on the original input first, then mask the remaining words in the input and let PLMs generate again. The difference between these two predictions is quantified and used to measure the dependence. The remaining and missing words hold the knowledgedependent association in this sample. We repeat this measure on all the samples whose remaining and missing words have the KD association. Then the dependence on the KD association can be estimated by the average of the quantified difference. Figure 2b measures the effectiveness of the dependence on each association by calculating the correlation coefficient between the dependence and  the probing performance. Following (Petroni et al., 2019;Elazar et al., 2021a), the probing performance is indicated by the prediction accuracy and consistency when querying on the same fact with different prompts. Since the dependence on associations are quantified in the dependence measure, we can calculate the correlation coefficient between the dependence and performance over all the samples. Their correlation measures whether the dependence on an association is harmful or beneficial to the performance, showing the effectiveness of the dependence on an association quantitatively. Section Outline We organize the rest of this section as follows. In Section 2.2.1, we formalize how we quantify the dependence with the causal effect estimation. Section 2.2.2 gives detail about how we build the probing samples for different associations. Section 2.3.1 introduces the metrics we used to indicate the performance of factual knowledge capture. Section 2.3.2 describes the details about the effectiveness measure of associations.

Causal Effect Estimation for PLMs
To study the causal effect of the different input words, we build a Structured Causal Model (SCM) for the missing words generation process and apply interventions on some input words to estimate their effect quantitatively. We consider the missing words as outcome words and the remaining words that hold a certain association (e.g. positionally close) with the outcome words as treatment words. Then, we can formally represent the word generation process with SCM, as the following structural equations: Separate the words in a sentence into three groups: treatment words W t , outcome words W o , and context words W c (specified by w t , w o , and w c respectively). Equation 1 formulates the following data generation process: (1) Sample a sentence from the natural text space I and get the context words w c using function f . (2) Generate the treatment words w t by the PLM based on w c only.
(3) Generate outcome words w o based on both the w c and w t .
To obtain the quantitative causal effect of treatment words W t on outcome words W o , we apply the do-calculus do() (Pearl, 2009) on treatment words W t to introduce interventions for estimating the causal effect. do() denotes the operation of forcibly setting the value of W t . Then the causal effect of W t on W o can be estimated by the Average Treatment Effect (ATE) (Rubin, 1974): (2) Accordingly, we define ATE for PLMs as: whereŵ t is the ground truth of the treatment words W t (the original value of W t without intervention). w m is the intervention value (several [MASK]s) for W t , and we use it to simulate removing the groundtruth valueŵ t from the input. P (s) denotes the probability of selecting the sample s that consists of w t , w o , and w c from I. PLM(·, ·) denotes the output of PLMs with certain input. Table 1 illustrates the interventions on the SCM for different associations.
The raw output of PLMs is a probability distribution over fixed vocabulary. We transform the output into reciprocal rank to quantify the differences: (4) w o is the ground-truth outcome words. rankŵ o is the rank position ofŵ o according to the generation probability ofŵ o output by PLM(w c , w t ). We set k to 100 and use PLM 100 to replace PLM in Equation 3 to calculate ATE. The ATE reflects the effect of W t on the prediction of W o , it can be regarded as a quantitative estimation of how much PLMs depend on W t when generating W o .

Mark words by Associations
Wikipedia is a rich source of knowledge (Thom et al., 2007;Hassanzadeh, 2021), and most of the PLMs nowadays have been pre-trained on Wikipedia (Devlin et al., 2019;Liu et al., 2019;Lan et al., 2019), so we take Wikipedia sentences as pre-training samples to construct the probing samples for dependence measure. We probe the mask-filling on these sentences to analyze what PLMs based on when capturing factual knowledge in pre-training.
The outcome we want to observe is the predictions of factual words in the sentences. In order to locate the factual words, we align each sentence with a triplet (subject, predicate, object) in the KB. The words that correspond to the object serve as outcome words W o for observation, and the remaining words that hold an explicit association with W o are marked as treatment words W t for intervention. For different associations, the W t is identified as: 1. Knowledge-Dependent: all the remaining words correspond to the predicate and object in the same triplet with the W t .
2. Positionally Close: the remaining words closest to W o .
3. Highly Co-occurred: the remaining words that have higher Pointwise Mutual Information (PMI) (Church and Hanks, 1990) with W o . The PMI is calculated over all the Wikipedia sentences using the following equation: whereŵ o is a group of words (a span) and w i is a single word.
4. We further define a Random (R) association, where the W t are randomly selected remaining words. It provides some empirical support for how much the modifications in the context affect the mask-filling output.
Accordingly, one sentence yields four probing samples for the four associations, respectively. The four probing samples share the same W o but use different words as W t to show the dependence on different associations when predicting the same W o . We preserve that the number of words in W t is the same among different associations. For example, if there are two words used as W t by the KD association, we select the top two closest words with W o as W t for PC, and the words have the top two PMI with W o for HC. We can obtain a set of probing samples for each association. The sample sets for different associations source from the same set of sentences and have the same size.

Measure the Effectiveness of Associations
This section investigates which association can lead to better performance on factual knowledge capture. We first define the metrics to evaluate the performance, then we measure the effectiveness of an association by relating the dependence on this association with probing performance.

Metrics for Factual Knowledge Probing
Section 2.2 uses the original Wikipedia sentence as pre-training samples to quantify the dependence PLMs used to capture the corresponding fact in pre-training. The performance of capturing the corresponding fact is probed by having PLMs fill masks on crafted quires. We construct these queries by instantiating templates on triplets (Petroni et al., 2019). T i (s) denotes the i-th query for the fact corresponds to s. The accuracy mrr of capturing this fact is obtained by averaging over all the predictions obtained with different queries: PLM k (T i (s)) denotes the reciprocal rank of the ground truth in the PLM's output for query T i (s), it is defined in Equation 4. The consistency of the capture is indicated by the percentage of the pairs of queries that have the same result (Elazar et al., 2021a): There are n different queries on every fact, and we can get n 2 = n(n − 1) pairs of predictions in total. PLM(T i (s)) denotes the top-1 output for the query T i (s). The value of 1 PLM(T i (s))=PLM(T j (s)) is an indicator function that takes the value 1 if the PLMs returns identical at top-1 for T i (s) and T j (s) and 0 otherwise. The PLMs are better on the consistency metric if they keep the predictions consistent when queries only vary on the surface forms. E.g., the two queries "Dante was born in [MASK]" and "The birthday of Dante is [MASK]" should return the same results.
Finally, we evaluate the factual knowledge capture performance by jointly examining the accuracy and consistency (Elazar et al., 2021a): test(s) measures the probing performance on template-based queries. We also define a metric to measure how well the PLMs memorize the missing words in pre-training samples (Wikipedia sentences): train(s) = PLM k (s).

Correlate Performance with Dependence
We have quantified the dependence on each association and defined the metrics for probing performance in the above sections. We then calculate the Pearson correlation coefficient (Kirch, 2008) between dependence and probing performance to reveal the effectiveness of different associations. An association is considered more effective if the probing performance positively correlates with its dependence more. Because only part of the facts has available templates, the samples in the dependence measure without templates are ignored in the calculation. The factual knowledge captured by different PLMs may vary significantly due to the differences in model scale, pre-training data, or other settings. To make the correlation coefficient comparable between different PLMs, we calculate the correlation only on the factual knowledge gathered correctly by the PLM. I.e., only the pre-training samples with train(s) = PLM k (s) = 1 are involved.

Probing Data and PLMs
We use the TREX dataset (Elsahar et al., 2018), which aligns KB triplets with Wikipedia sentences, to construct the samples for the dependence measure following the definition in Section 2.2.2. We employ the templates from (Elazar et al., 2021a) to construct the queries to probe the factual knowledge for the effectiveness measure. Table 2 shows the statistics for the data in the dependence measure and the effectiveness measure. The PLMs we analyzed include BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), SpanBERT (Joshi et al., 2020), and ALBERT (Lan et al., 2019).

Dependence on Associations
The dependence on an association is the average ATE over the probing samples whose treatment words hold that association with the outcome words. Table 3 shows the quantified dependence of different associations. The accuracy in Table 3 represents the accuracy of recovering the masked factual words in pre-training samples, revealing how well does PLMs memorize the pre-training samples. It is calculated by Equation 9 with k = 1. We find a general trend over all the picked PLMs: the Positionally Close (PC) association takes the dominant effect on the prediction results, the Highly Co-occurred (HC) association comes second, and the least for the Knowledge-Dependent (KD) association. The trend does not change much as increasing the model scale (large vs. base), using additional training data (RoBERTa vs. BERT), or improving the masking strategy (SpanBERT vs. BERT). Consequently, the accuracy drops the most when perturbing the positionally close words but least on knowledge-dependent words 1 . 1    The results provide quantitative evidence for Question 1 of "Which association do PLMs depend on to capture factual knowledge?:" PLMs prefer the associations founded with positionally close or the highly co-occurred words to the knowledge-based clues. It is different from how a conventional KB works, e.g., an object can be retrieved by the corresponding subject and predicate.

Correlations between Dependence and Performance
We show the correlation between association's dependence and the probing performance in Figure 3. Each point in the figure represents a piece of factual knowledge s. We refer to it as a fact for convenience. The horizontal axis indicates test(s) for the fact, showing the probing performance of the fact with effectiveness measure. The vertical axis shows the dependence of associations when capturing this fact, which is quantified by the causal effect estimation defined in Section 2.2.1. The straight lines are the regression lines and different associations are shown in different line styles 2 .
when perturbing the different associations. 2 We standardize the quantified value of dependence (denoted as Std. Dependence) and plot a bucket of facts as a single point to show the trends clearly. The correlations with-As we can see from the results, the dependence on the KD association positively correlates with the probing performance. The dependence on the HC association has a slightly positive correlation or almost has no correlations sometimes (such as ALBERT in Figure 3d). The PC association holds a negative correlation with the performance.
These results can give an empirical answer to "Is the association on which PLMs depend effective in capturing factual knowledge?:" the more PLMs depend on the Knowledge-Dependent (KD) association, the better PLMs can capture the corresponding factual knowledge. Meanwhile, relying much on the positionally close association is harmful to the probing performance.
The dependence measure results reveal that the PLMs depend most on the positionally close but least on the knowledge-dependent association. However, in effectiveness measure, we find that the positionally close association is the most ineffective for factual knowledge capture while the knowledge-dependent association is the most effective. By connecting the two results, we can conclude the answer to the question in the title: The PLMs do not capture factual knowledge ideally, out standardization for more PLMs are in Table 8 Pre  since they depend more on the ineffective associations than the effective one.

Case Study
To illustrate the analysis result intuitively, we show two cases with SpanBERT-large in Table 4. The MRR shows the probing performance on the template-based query (calculated by Equation 6). In Case 1, the knowledge-dependent association gains the biggest effect, and the predictions are robust in all the template-based probing. However, the positionally close association takes the main effect in Case 2, while the PLM fails to recall the word "England" with the template-based queries.

Discussions
Generality of the Proposed Probing Method Generally, the dependence measure offers a way to measure how much the word-level patterns cause the prediction of missing words in Mask Language Model (MLM). Because words are readable, directly visible, and can be manipulated from the input side directly, the word-level patterns can provide more intuitive interpretations than numeric representation vectors (Elazar et al., 2021b) or neurons (Vig et al.). We use the proposed method to estimate the causal effect of three typical associations in this paper, while this method can be easily adapted to quantify the dependence on any wordlevel patterns.
Reconsidering "PLM as KB" If we want to use a PLM like a KB, whether the PLM has the same inner workflow as KBs deserves to be considered. The prevalent KBs index knowledge as subjectpredicate-object triplets and can infer with triplets (Speer et al., 2017;Bollacker et al., 2008;Vrandečić and Krötzsch, 2014). However, we find out that the knowledge-dependent association, which represents the process of inferring a missing object based on the given subject and predicate, has the lowest dependence in the PLMs. It provides evidence that the PLMs work quite differently with KBs and can not serve stably as KBs for now.
Overfiting and Generalization Figure 4 shows the correlations between the dependence on associations and the mask-filling accuracy on pre-training samples (referred to as memorizing accuracy). The memorizing accuracy increases most as the dependence of the PC association increases, demonstrating that the more PLMs depend on the positionally close words, the better PLMs can recover the pretraining samples. However, there is an opposite trend in probing performance as shown in Figure 3. The additionally crafted queries used to evaluate the probing performance are mostly unseen in pertraining. If we consider these queries as the test set and the pre-training samples as the train set, we can conclude that the dependence on the PC associations makes the PLMs tend to overfit the training data and degrade the generalization on the test set. Factual Knowledge Capture in Pre-training We want to focus on the pre-training samples that help PLMs to capture factual knowledge. So we reconstruct the pre-training samples that predict some missing factual words (object) based on the factual clues (subject, predicate). We conduct the dependence measure on these samples to investigate how the factual knowledge is captured in pre-training. The mask-filling accuracy on these pre-training samples denotes how well PLMs memorize them in pre-training. We name it as "train" in Equation 9 and "Memorizing Accuracy" in Figure 4.

Overlap between Associations
The clues for different associations overlap sometimes, e.g., some remaining words may hold the KD and PC associations with the missing words at the same time.
The overlaps do not impair the estimations because we use a set of samples to estimate the effect of each association. The samples that hold the same association stay in the same set, and the average causal effect in all these samples is the quantified dependence of this association. The sample sets are quite different for different associations. Table 5 shows the corresponding statistics of the overlaps.

Related Works
Probing Factual Knowledge in PLMs Factual Knowledge Probing in PLMs has attracted much attention recently. LAMA (Petroni et al., 2019) propose a benchmark that probes the factual knowledge in the PLMs with cloze-style prompts and shows PLMs' ability to capture factual knowledge. This ability can be further explored by tuning the prompts for probing Jiang et al. (2020); Shin et al. (2020); Zhong et al. (2021). Motivated by the probing results, some recent works analyze the captured factual knowledge from more perspectives. Cao et al. (2021) analyze the distribution of predictions and the answer leakage in probing. Poerner et al. (2020) propose that the PLMs could predict based on some correlation between surface forms rather than infer according to facts. Elazar et al. (2021a) reveal that the PLMs' outputs are inconsistent as querying the same fact with different prompts.
This paper proposes a more fine-grained inspection of word-level patterns in the input. In addition to constructing more challenging probing data as input or analyzing the outputs more detailedly, we try to reveal the inner mechanism of PLMs by conducting intervention on the input and then observing the change in the output. Causal-inspired Interpretations in NLP A causal-inspired approach to explanation is to generate counterfactual examples and then compare the predictions (Feder et al., 2021a). Feder et al. (2021b) propose a framework for producing explanation for NLP models using counterfactual representation. Vig et al. analyze the effect of neurons (or attention heads) on the gender bias using causal mediation analysis. In this paper, we revisit the word-level post-hot interpretation  from a causal-effect perspective: intervene on some specified words in the input and measure the difference in the output to estimate the causal effect of these words. Furthermore, we evaluate the effectiveness of different causes by calculating correlations between their effects and performance. As far as we know, our work is the first study to probe and evaluate word-level patterns in the factual knowledge capture task.

Conclusion
In this paper, we try to answer the question of How Pre-trained Language Models Capture Factual Knowledge by measuring and evaluating different associations that PLMs use to capture factual knowledge. We present three word-level associations, knowledge-dependent, positionally close, and highly co-occurred in the analysis. The analysis results show that the PLMs rely more on the ineffective positionally close and highly co-occurred associations when capturing factual knowledge, and somewhat ignore the effective knowledgedependent clues. These findings indicate that we should pay more attention to the knowledgedependent association to let PLMs capture factual knowledge better.
We use the T-REx 4 dataset to provide the initial alignment between KB triplets and the Wikipedia sentences. We use the aliases in KB as keys for fuzzy string match (Levenshtein distance is less than 2, stemming before matching, etc.) to align more subjects, predicates, and objects with spans in the sentence. The sentences that have no aligned triplet are filtered out.
Sometimes, the outcome words in a single sentence relate to multiple triplets that satisfy the rules for KB. E.g., there are two groups of remaining words that can infer the outcome words deterministically based on KB. We select them all as the W t when probing the KB association and keep the number of the masked words be the same in interventions for the other associations. D KD , D PC , and D HC denote the sample sets for the Knowledge-Dependent (KD), Positionally Close (PC), and Highly Co-occurred (HC) associations, respectively.
For the Highly Co-occurred association (HC), the remaining words that are top-k in PMI with the ground-truth outcome wordsŵ o are selected as W t . The k is the number of words with the KD associations for the same sentence. The PMI between words is calculated by all the Wikipedia sentences. If theŵ o consists of multiple words, occurring with all the words inŵ o altogether are taken as co-occurring. The order of the words in w o are ignored for efficiency. Table 5 shows more details about the probing samples.

B More Probing Results
The Pearson correlation coefficients between the dependence on associations (raw value without standardization) and the performance are shown in Table 8. The three metrics, accuracy (defined in Equation 6), consistency (defined in Equation 7), and the overall performance metric (defined in Equation 8), are reported respectively. The correlation coefficients between the dependence and the performance are consistent with the slopes of the regression lines in Figure 3. Table 6 shows the accuracy decreasing results after masking the treatments words when generating the missing words in Wikipedia sentences.