Factual Probing Is [MASK]: Learning vs. Learning to Recall

Petroni et al. (2019) demonstrated that it is possible to retrieve world facts from a pre-trained language model by expressing them as cloze-style prompts and interpret the model’s prediction accuracy as a lower bound on the amount of factual information it encodes. Subsequent work has attempted to tighten the estimate by searching for better prompts, using a disjoint set of facts as training data. In this work, we make two complementary contributions to better understand these factual probing techniques. First, we propose OptiPrompt, a novel and efficient method which directly optimizes in continuous embedding space. We find this simple method is able to predict an additional 6.4% of facts in the LAMA benchmark. Second, we raise a more important question: Can we really interpret these probing results as a lower bound? Is it possible that these prompt-search methods learn from the training data too? We find, somewhat surprisingly, that the training data used by these methods contains certain regularities of the underlying fact distribution, and all the existing prompt methods, including ours, are able to exploit them for better fact prediction. We conduct a set of control experiments to disentangle “learning” from “learning to recall”, providing a more detailed picture of what different prompts can reveal about pre-trained language models.


Introduction
Pre-trained language models like BERT are optimized to predict the distribution of words in an Internet corpus (Devlin et al., 2019). Naturally, this distribution encodes information about world facts. Recently, researchers have taken an interest in measuring how much factual information language models acquire from pre-training. Petroni et al. (2019) formally define this project in the LAMA * The first two authors contributed equally. 1 The code is publicly available at https://github. com/princeton-nlp/OptiPrompt.  Figure 1: A linguistic probe is trained to predict linguistic annotations given the representations returned by a language model, and evaluated on a held-out set of sentences. A factual probe is trained to predict an object for a subject and a relation using a pre-trained language model, and evaluated on a held-out set of subject-object pairs that express the same relation.
benchmark, which consists of (subject, relation, object) triples along with human-written templates that express each relation. They show that BERT can predict objects given cloze-style prompts-for example, "Dante was born in [MASK]"-and they present their result as a lower bound on the amount of factual information BERT encodes. Subsequent work has attempted to tighten this bound by finding better prompts. Jiang et al. (2020) use text mining and paraphrasing to find a set of candidates and select the prompts that lead to the highest accuracy on a training set. Shin et al. (2020) train a model to generate prompts automatically by searching for the sequence of tokens that maximizes expected likelihood of the gold object label. Both of these methods collect additional triples from Wikidata to use for tuning their prompts.
In this paper, we first take a natural next step in the search for better prompts: rather than confining our search space to discrete input tokens, we di-rectly optimize in the input embedding space, finding the real-valued input vectors that are most effective at eliciting facts. We also find that initializing with manual prompts can provide a better starting point for the search process. Our approach, OP-TIPROMPT, is simple and compute-efficient, and improves accuracy on the LAMA benchmark from 42.2% to 48.6%, compared to previous discrete alternatives. On the more difficult LAMA-UHN split (Poerner et al., 2019), which filters out easyto-guess entity names, OPTIPROMPT improves accuracy from 31.3% to 38.4%.
At the same time, we observe that prompts that are optimized on training data may exploit some regularities in the underlying distribution of facts. How can we make sure our prompts are recovering information solely from the language model? An analogous question has been explored recently in linguistic probing, which aims to explore the linguistic properties encoded in contextualized word representations (Belinkov et al., 2017;Tenney et al., 2019;Lin et al., 2019)-for example, by seeing if a classifier can predict that "chef " is the nominal subject of "made" given the representations returned from a language model (Figure 1). Recent work has attempted to disentangle the information encoded in the representations from the information learned by the probe (Hewitt and Liang, 2019;Pimentel et al., 2020;Voita and Titov, 2020;Zhu and Rudzicz, 2020). However, this question has not been yet explored in factual probing, in part because it is assumed that there is no way to predict a knowledge fact simply from observing a non-overlapping set of facts about other entities. 2 For example, learning that Dante was born in Florence should tell you nothing about the birthplace of John Donne.
We analyze our training data and find that this assumption is not warranted. Even though the training data was collected independently of the LAMA benchmark, there are sufficient regularities in the underlying distribution of Wikidata relations that a naive classifier fit to the training data can achieve surprisingly good performance. Furthermore, our experiments reveal that all the data-driven promptsearch methods, including previous methods and our proposed OPTIPROMPT, are able to exploit this 2 In knowledge base completion or link prediction, researchers study how to predict a fact (Barack Obama, nationality, ?) from other triples such as (Barack Obama, place_of_birth, Honolulu) and (Honolulu, city_of, USA). In knowledge probing, the underlying assumption is that one can't predict facts from the other facts of the same relation. information to achieve better prediction accuracy. Given some training data, a good search algorithm can find prompts that recover a non-trivial number of "facts" from a neural network with randomly initialized parameters, exploiting both simple class statistics and higher order lexical regularities.
This finding makes it challenging to interpret relative accuracy scores on the knowledge probing task. We show how our control experiments allow us to form a more detailed understanding of the behavior of different probes. For example, by partitioning the test set into "easy" examples, which can be predicted by random controls, and "hard" examples, we can form some conclusions about which facts are less likely to have been learned from training data. OPTIPROMPT outperforms prior methods in both subsets, suggesting it is both better at learning from training data and better at eliciting facts from a language model. We conclude with suggestions for future work that might be less susceptible to the confounding effect of training data.
2 Background: Prompting for Facts 2.1 LAMA The factual probing setting was introduced by the LAMA benchmark (Petroni et al., 2019), which is designed to measure the amount of factual information encoded in a pre-trained language model (LM). In LAMA, a fact is defined as a triple s, r, o , where s is a subject (e.g., Dante), r is a relation from a fixed set of relations R (e.g., place of birth), and o is an object (Florence). LAMA facts are drawn from a number of sources, including Wikidata, ConceptNet (Speer and Havasi, 2012), and SQuAD (Rajpurkar et al., 2016). We follow recent factual probing work (Jiang et al., 2020;Shin et al., 2020) in focusing on the T-REx split (Elsahar et al., 2018), which contains up to 1000 s, r, o triples for each of 41 Wikidata relation types. The relation types are divided into three categories: 1-1 includes relations like capital of ; N-1 includes relations like place of birth; and N-M includes relations like shares border with. In the LAMA evaluation, each relation is associated with a human-written prompt that contains a single [MASK] token-for example, "[X] was born in [MASK]." To accommodate masked language models such as BERT, LAMA is restricted to facts for which the object label is a single token in a predefined vocabulary

Method Prompt
Data-driven?
LAMA (Petroni et al., 2019) [X] is [MASK] citizen LPAQA (Jiang et al., 2020) [X] is a citizen of [MASK] AUTOPROMPT (Shin et al., 2020) [X] m 3 badminton pieces internationally representing [MASK] OPTIPROMPT [V] i := w indicates that the vector is learned but initialized by the pre-trained embedding of word w and OPTIPROMPT (manual) indicates that we use a manual prompt as initialization (see Section 3 for more details).
V. 3 Given a subject s, a relation prompt t r , and a masked language model, we can identify the word o ∈ V to which the LM assigns the highest probability of P ([MASK] =ô | t r (s)), where t r (s) represents the prompt template with the subject placeholder [X] replaced by s. Ifô is the same as the gold object o, we conclude that the LM encodes information about the fact. LAMA is an evaluation benchmark, so there is no training data. It is constructed so that a pretrained language model can be evaluated "off-theshelf" with no additional fine-tuning. Petroni et al. (2019) remark that their benchmark provides only a lower-bound estimate of the amount of factual information stored in an LM, because their manually written prompts might not be optimal for eliciting facts. Accordingly, subsequent work has focused on tightening this bound by using additional training data to find more optimal prompts. Jiang et al. (2020) use a range of text-mining and paraphrasing techniques to generate a set of candidate prompts for each relation. They collect a training dataset from Wikidata, ensuring that there is no overlap with subject-object pairs in the LAMA benchmark, and select prompts by measuring accuracy on this training data. They consider a number of rules for selecting prompts, including top-K baselines and an "optimized ensemble", which consists of multiple prompts per relation with weights tuned on the training data. Their prompt dataset, LPAQA, is available online. 4 3 Subject names are usually longer, with an average length of 3.7 tokens using the BERT-base-cased vocabulary. 4 https://github.com/jzbjyb/LPAQA

AUTOPROMPT
Shin et al. (2020) take prompt optimization one step further by training a statistical model, AUTO-PROMPT, to search over the space of input tokens for prompts that elicit correct predictions. They collect 1000 s, r, o triples for each relation type, either from the original T-REx dataset (Elsahar et al., 2018) or from Wikidata, with no triples that appear in the LAMA benchmark. They define a prompt for a given relation r as the subject followed by a fixed number of "trigger" tokens: where [X] is replaced by the subject, [T] i represents a "trigger" token which can be any token in the vocabulary, and the number of [T] tokens is set as a pre-defined number m. The tokens are initialized as [MASK] tokens and then iteratively updated, at each step using a gradient-based searching algorithm (Wallace et al., 2019) to replace one of the trigger tokens with the token that is estimated to maximize the likelihood of the gold label on the training set.

Our Approach: OPTIPROMPT
Our approach is motivated by the view that restricting the search to the space of vocabulary tokens is a suboptimal and artificial constraint. In the case of AUTOPROMPT, optimizing over a discrete subspace is also inefficient: at each step we have to enumerate a set of candidate tokens, replace the selected trigger token, and re-run the model (Shin et al., 2020). The examples in Table 1 also illustrate that optimized textual prompts can be opaque, despite consisting of tokens from the English vocabulary. This undermines one argument in favor of natural language prompts, which is that they are   (Poerner et al., 2019), which is a subset of LAMA where questions with helpful entity names were deleted. The LAMA results are broken down by relation category. Examples from each category are capital of (1-1), place of birth (N-1), and shares border with (N-M).
human readable so might be easier to interpret.
OPTIPROMPT In this view, we propose OP-TIPROMPT, a method for continuous prompt optimization. Rather than limiting the search to the space of discrete tokens, OPTIPROMPT searches for optimal prompts directly, composing prompts using any vector in the embedding space. We first follow AUTOPROMPT and define a prompt in the following form: where each [V] i ∈ R d is a dense vector with the same dimension as the LM's input embedding (e.g., 768 for BERT-base) and the number of [V] vectors is set to a pre-defined number m. Treating prompts as dense vectors allows us to search for optimal prompts much more efficiently. Given some initial values for [V] i , we keep all other model parameters fixed and use gradientdescent to minimize the negative log-likelihood of a training set: where D r is the set of (subject, object) pairs with relation r and t r represents the prompt template for relation r with subject tokens s substituted for the placeholder [X].
In this basic form, we pick a fixed value for m (treated as a hyperparameter) and randomly initialize all the [V] tokens. We also consider a more sophisticated form of using manual prompts (we use the prompts provided in the LAMA benchmark) to decide the number as well as the position of the [V] tokens for each relation and initialize each [V] i with the pre-trained input embedding for the corresponding tokens in the manual prompt. As shown in Table 1, we can convert a manual prompt and use the embeddings of is and citizen to initialize [V] 1 and [V] 2 respectively. Our motivation is that a good initialization is likely to be important in this challenging non-convex optimization problem.
Setup We train OPTIPROMPT using the data collected by Shin et al. (2020), which contains 800 training examples with 200 held out for development. For our main experiments, we probe the BERT-base-cased model and we compare other pre-trained language models in Appendix C. We report top-1 micro-averaged accuracy: where R is the set of relations, D r is the set of (subject, object) pairs with relation r, andô = arg max o P ([MASK] = o | t r (s)). More implementation details can be found in Appendix B.1.
LAMA results Our results are in Table 2. Overall, OPTIPROMPT outperforms the previous reported results in terms of accuracy on the LAMA benchmark. Compared to AUTOPROMPT 5 , our models perform 5.4%-6.4% higher on LAMA and 6.2%-7.1% on the more-difficult LAMA-UHN benchmark. The improvement is consistent across all categories, with the exception of the "1-1" category, which contains two relations, capital and its inverse, capital of. Interestingly, the prompt that yields the best results in this category is the manual prompt, with LPAQA and AUTOPROMPT prompts performing steadily worse. We speculate that there are very few prompts that elicit this relation with high accuracy and they are difficult to find via stochastic, non-convex optimization.
We also find that initializing the prompt vectors using the manually written prompts improves performance consistently. This confirms our intuition that the manual initialization provides a good prior for finding a good solution in the non-convex optimization problem. The results are broken down by relation in Table 8 in the Appendix.

Can We Trust Optimized Prompts?
Our factual probing results confirm that OP-TIPROMPT is an effective approach, outperforming the best previous method by 6.4% on the LAMA benchmark. However, can we conclude that BERT encodes 6.4% more facts than was previously known? Our prompts, like LPAQA and AUTOPROMPT, are optimized on in-distribution Wikidata relations, which raises the possibility that they exploit some regularities in the underlying fact distribution. In this section we aim to answer two questions. First, are there patterns in the Wikidata fact distribution that statistical model could theoretically exploit to predict unseen facts? Second, are optimized prompts capable of exploiting these patterns in practice?

Facts can be predicted from training data
We first examine whether it is possible to predict any facts by just looking at the training data. The simplest pattern is the class prior P (o | r): if one or two object labels dominate the relation r, it is easier to guess them regardless of the subject entity. A more sophisticated pattern is to find a correlation between subject tokens and object labels-that is, to estimate P (o | r, w 1 , ..., w |s| ), where w 1 , . . . , w |s| ∈ V are the tokens of the subject name. To see whether such patterns exist, we fit two simple probabilistic models to the Wikidata training set collected by Shin et al. (2020). The first model always predicts the majority class, with class priors learned from the training data, and the sec-

Relation
Class Prior Naive Bayes  Table 3: Results for simple classifiers fit to the Wikidata training data and evaluated on the LAMA test set. We highlight two relations for which object labels are correlated with particular subject tokens: In the member of category, the model appears to learn that any subject with "football" in its name, such as Ghana Football Association, is likely to be a member of FIFA. In the manufacturer category, the model learns to predict that Chevrolet manufactures the Chevrolet Impala, BMW manufactures the BMW M Coupe, and so on.
ond is a Naive Bayes classifier (bag-of-words) with add-one smoothing (see details in Appendix B.2). Table 3 shows the accuracy of these models on the LAMA benchmark, averaged over relations. The majority class model performs well because, on some relations, well over half of the examples are from the majority class. 6 The Naive Bayes baseline performs even better in all categories by learning correlations between subject tokens and object labels. This analysis complements an observation of Poerner et al. (2019), who point out that BERT can exploit superficial information in a cloze prompt to "guess" the correct answer-for example, predicting that people with stereotypically Italian names were likely born in Rome. Our results show that it is possible to learn these correlations even without prior information about entity names, and there might be other, subtler patterns in the Wikidata distribution.

Prompts can exploit training data
We have shown that the training data clearly encodes certain regularities and simple statistical models can learn to fit the training data. In the following, we study whether a prompt optimization method built with pre-trained language models, is expressive enough to exploit these regularities in practice. We attempt to answer this question by means of two random controls, inspired by similar proposals from linguistic probing. In our Random Model (RM) baseline, we optimize prompts to elicit : Accuracy on LAMA obtained by prompting BERT-base-cased, either the pre-trained model, reinitializing the input embeddings, or reinitializing all parameters. Each bar represents total accuracy micro-averaged over relations and divided into two categories: accuracy obtained by predicting the training set majority class label, and accuracy obtained by predicting other object labels. We also fine-tune BERT, which, in the random control settings, can be thought of as a better lower bound on the entropy of the task distribution.
facts from a neural network with the same architecture as the pre-trained LM but with randomly initialized parameters. This is analogous to a control function Pimentel et al. (2020), a function that removes information from a linguistic representation. Any successful predictions in this setting must be the result of optimizing on training data. We also consider a Random Embeddings (RE) baseline, where we reinitialize only the input embeddings. 7 This is analogous to a control task (Hewitt and Liang, 2019), a variant of the probing task in which word types are associated with random labels. 8 Our motivation is that the Random Model setting is more difficult to optimize, so might underestimate the ways a prompt model could exploit information from the training data. Finally, we directly finetune a reinitialized BERT model on the training data with the goal of getting a better estimate of the number of LAMA facts that could be predicted from the training data.
The results are shown in Figure 2 (see implementation details and more results in Appendix B.1 and Table 8). In the Random Embeddings setting, both AUTOPROMPT and OPTIPROMPT are capable of finding prompts that elicit some correct predictions. In the Random Model setting, AUTOPROMPT gets 0% of predictions correct, presumably because it is more difficult to optimize, but OPTIPROMPT is still capable of finding successful prompts. Most successful predictions are obtained by finding a prompt that elicits the majority class label, although OP-TIPROMPT also makes a number of correct predictions that cannot be attributed to this strategy. Our qualitative analysis suggests that these prompts exploit both class statistics and correlations between objects and subject tokens (Appendix A.2).
Fine-tuning BERT results in even higher accuracy, indicating that there are patterns that prompts fail to exploit. The random controls represent a challenging setting for prompt optimization, and it is possible that the prompts are better exploiting the training data when they have access to full pretrained BERT model. We find evidence that this is the case by calculating how often each prompt elicits the training class majority label on LAMA, plotting the results in Figure 3. Both AUTOPROMPT and OPTIPROMPT are prone to over-predicting the majority class label. For example, although AU-TOPROMPT gets 0% accuracy in the RM setting, it finds a prompt that elicits the majority label more than 95% of the time for six relations when optimized on the pre-trained BERT model. 9 LPAQA prompts predict the majority class less often, possibly because they are less effective at

How to Interpret Probing Results?
Our analysis in Section 4.2 shows that optimized prompts can predict new facts from training data. How can we interpret our factual probing results in 10 https://github.com/jzbjyb/LPAQA/blob/ master/prompt/paraphrase/P106.jsonl this light? In order to get another perspective of the relative improvement, we partition LAMA into an easy subset and a hard subset (examples from each subset can be found in Table 5). The easy subset consists of the facts that can be correctly predicted by any of three models fit to the training data: the Naive Bayes model described in Section 4.2 and a fine-tuned BERT model with either token embeddings reinitialized or all parameters reinitialized. The easy subset serves as an estimate of the set of facts that can be predicted from training data. The hard subset consists of the remain facts. Table 4 shows the results of each prompt on these two subsets of LAMA (the per-relation results are given in Table 9). First, we observe that all the probing methods achieve a much higher accuracy on the easy subset. Using more sophisticated prompt optimization techniques tends to result in big improvements on the easy subset of LAMA and smaller improvements on the hard subset. OPTIPROMPT outperforms AUTOPROMPT by 7.4% on the easy examples; while on the hard examples, where we filtered out facts that we know can be predicted from the training data, OPTIPROMPT also yields a big improvement (+6.3%). This suggests that OP-TIPROMPT is both better at learning from training data and better at eliciting facts from an LM.
For a more qualitative analysis, we randomly sample ten facts from each subset, keeping only facts that are predicted correctly by at least one model and exclude examples that have the majority class label. The examples, shown in Table 5, give a better idea of the types of predictions elicited by different prompts. For example, both AUTO-PROMPT and OPTIPROMPT appear to be exploiting the training data in some cases. In the easy subset, they elicit more accurate predictions on cases when  the answer is a token in the subject name. In the hard subset, they show signs of having over-fit to the training distribution, incorrectly predicting the most common object labels for continent (Antarctica) and manufacturer (IBM). OPTIPROMPT performs better than the other prompts on some facts in both categories. On an easy profession example, while AUTOPROMPT incorrectly predicts the majority label (politician), OPTIPROMPT-along with our Naive Bayes model-apparently encodes a lexical correlation between some aspect of the subject's name and the correct label, actor. On the other hand, OPTIPROMPT out-performs the other prompts on two more difficult examples: "Francis Hagerup used to work in Oslo" and "William Lyon Mackenzie Kingused to work in Ottawa." In both cases, LPAQA predicts the training majority label (London), AUTOPROMPT gets geographically closer (Copenhagen and Montreal), and OP-TIPROMPT predicts the correct city. We note that we cannot conclude that there is no way to predict these "hard" facts from training data. A more general limitation of this analysis is that it does not allow us to say which strategy a model uses to make a particular prediction. Many facts can be predicted either by learning the class prior; by learning a lexical correlation between subject tokens and objects; by exploiting lexical information from the LM; or because the LM genuinely encodes information about a particular entity. Still, the qualitative examples reveal interesting patterns in the behavior of the different prompt models that could not be observed from the summary accuracy results on the LAMA benchmark, and looking at specific predictions across a number of prompts gives us more evidence for deciding what kind of information the LM encodes about a particular fact.

Discussion
Our experiments show that OPTIPROMPT is an effective optimization algorithm, outperforming prior work at the task of eliciting facts from a pretrained language model. However, our results are complicated by the fact that any data-driven optimization can find prompts that encode new information from the training data. This leaves open the question of which method we should select if we are interested in factual probing.
Continuous vs. discrete prompts We find that both continuous and discrete optimization are capable of finding prompts that exploit the training data. Even when the prompt is discrete, it is rarely clear why a prompt elicits a particular prediction. 11 Hence, we believe that continuous prompting is more preferable, because it is easier and more efficient to optimize, and makes better predictions (in both easy and hard subsets). On the other hand, one drawback of OPTIPROMPT (which is shared by AUTOPROMPT) is that we need white-box access to the LM to compute the gradients. Discrete prompts will still be necessary in cases where the model parameters are not available, for example in the case of very large language models that are provided over an API.
Learning vs. learning to recall Regardless of how we choose to optimize prompts, it remains difficult to say why a model made a particular prediction-whether it was learned from training data or encoded in the LM. Some avenues for future work might be to consider techniques for attributing predictions to specific training instances, with the goal of developing a causal understanding of how facts are acquired during pre-training or prompt optimization. More generally, our real goal is to understand how pre-trained language models learn and represent information. Prompt-based probing might provide some insight into this question, but we hope that future research will eventually be able to provide more mechanistic explanations for neural network behavior. For example, it would be interesting to understand how information about entities is laid out in neural network parameters and later retrieved in response to an input prompt.

Related Work
Our work follows from the line of factual probing experiments initiated by Petroni et al. (2019), who introduced the LAMA benchmark for clozestyle factual probing. Subsequent work on LAMA has introduced data-driven methods for optimizing prompts (Jiang et al., 2020;Shin et al., 2020). Poerner et al. (2019) point out that many facts in LAMA can be predicted using lexical clues, and they introduce a new benchmark, LAMA-UHN, that is less susceptible to these heuristics. Our work follows these projects by introducing (a) more effective techniques for optimizing prompts, and (b) a more comprehensive approach for accounting for the role of train/test overlap. Concurrently with this work, other authors explore continuous prompt optimization: Haviv et al. (2021) use an encoder to map a manually written prompt to a sequence of continuous vectors, which are then replaced with the discrete tokens that are nearby in embedding space; Li and Liang (2021) propose Prefix-Tuning, which fine-tunes the left-most hidden representations in auto-regressive language models; Liu et al. (2021) use an LSTM to generate a sequence of prompt vectors. Prompting has been explored more generally as a method for achieving "few-shot" learning with language models (Brown et al., 2020;Schick and Schütze, 2020;Gao et al., 2020).
Linguistic probing is an extensive area of research that we do not attempt to summarize here (see Rogers et al., 2020 for an overview). Our work is most related to recent proposals about how to measure whether a probe is extracting information from a representation or learning to predict the annotation from probe training data. These include random baselines (Hewitt and Liang, 2019) and information-theoretic measurements (Voita and Titov, 2020). We adopt the notion of control functions from Pimentel et al. (2020). Our study also relates to a larger category of work diagnosing "shortcut learning" (Geirhos et al., 2020) in neural NLP models. McCoy et al. (2019) discover that models like BERT are often "right for the wrong reason", exploiting shallow heuristics rather than underlying linguistic structure, and similar effects have been discovered in many other tasks (Sugawara et al., 2018;Wallace et al., 2019).

Conclusion
We introduce OPTIPROMPT, an effective continuous method for optimizing prompts. Applied to factual probing, OPTIPROMPT outperforms the best previous prompt method by 6.4% on the LAMA benchmark. We find that the typical training data used for prompt optimization reveals useful information about the underlying task distribution, to the point that search algorithms can find prompts that recover "facts" even from a randomly initialized model. By comparing the predictions of different prompt methods across our different controls we can form a more detailed understanding of how different prompts behave and what they can reveal about pre-trained language models.

Ethical Considerations
Our experiments illustrate that the "facts" recovered from a pre-trained language model should not be considered real facts. Optimizing any kind of statistical model for factual prediction is likely to devolve into stereotype-learning as the model learns lexical correlations between entity names and object labels. This problem is more pronounced if our training distribution comes from a source like Wikidata, which we find to be imbalanced. More generally, language models that are trained on the Internet will model the toxic and harmful language that is found there, a welldocumented finding for pre-trained language models like BERT (e.g., Gehman et al., 2020;Nadeem et al., 2020). Using such models for factual prediction is liable to amplify those biases. OPTIPROMPT is intended to be a diagnostic tool and generalpurpose optimization method, not a way to use BERT as a knowledge base.
It's not just size that matters: Small language models are also few-shot learners. arXiv preprint arXiv:2009.07118.

A Detailed Results
A.1 Breakdown Accuracy for LAMA Table 7 shows the per-relation accuracy for each prompting method. In many cases, we can better understand the probing results by examining the specific predictions each method makes.

A.2 Exploiting Training Data
Majority class baseline Figure 3 shows that all optimized prompts have a tendency to overpredict the majority class label. This behavior is most pronounced in the gradient-based methods (AUTOPROMPT and OPTIPROMPT). It is not always clear why a particular prompt elicits these predictions. For example, Shin et al. (2020) attempt to prevent AUTOPROMPT from "cheating" by filtering out prompts that contain proper nouns or gold object labels, but there are still six relations for which AUTOPROMPT elicits the majority label more than 95% of the time. The AUTOPROMPT prompts for these relations are: This illustrates that even discrete prompts are capable of finding prompts that elicit a specific label from an LM, and the mechanism by which these prompts elicit the prediction is often obscure. Perhaps more surprisingly, even LPAQA occasionally finds prompts that are more likely to elicit the majority label compared to the manual prompt. The changes in these cases are often very subtle. For example, the manual prompt for the position of relation is "[X] has the position of [MASK]" and the LPAQA prompt is "[X] has the position of a [MASK]". Simply inserting the determiner "a" into the prompt leads BERT to predict the majority label, bishop, more than five times as often compared to the manual prompt (50.9% vs. 9.5%), and almost twice as often relative to the true distribution in the LAMA benchmark (27.3%). This suggests that even simple data-driven methods can find prompts that encode some regularities in the training data and result in over-estimates of the number of facts in the language model. Table 8 shows the accuracy of optimized prompts under our random controls (Section 4.2) and also shows how much accuracy can be attributed to predict the majority class label. AUTOPROMPT cannot predict any facts in the Random Model setting but performs decently on several relations in the Random Embeddings setting by predicting the majority class. For reasons we cannot entirely explain, there is one relation, occupation, for which AUTOPROMPT's performance cannot be attributed to the class prior. The correct predictions in this category are all a result of predicting actor, which AUTOPROMPT predicts 23.3% of the time. (The most frequent label in the training data is politician.) Other high frequency predictions for this relation include jet, wool, and smart. Notably, even when AUTOPROMPT finds a prompt that can draw out the class prior, it typically does not elicit the class prior 100% of the time.

Control result details
OPTIPROMPT is more successful at exploiting the training data. In the Random Model setting, virtually all correct predictions can be attributed to the majority class, which OPTIPROMPT can frequently elicit for all inputs. One noteworthy exception is languages spoken, where OPTIPROMPT is able to successfully classify subjects as speaking either English or French in some cases. It is not immediately clear what decision rule the model learns for these predictions-for example, it could be that the model predicts either English or French at random, in rough proportion to the training distribution; or the model is able to use the correlations between names and spoken languages. In any case, the results illustrate that optimized prompts can learn more sophisticated strategies than simply predicting the majority class, even given a Transformer that contains no prior information at all.
A.3 LAMA-easy and LAMA-hard Table 9 shows the accuracy of different prompts on the easy and hard subset of LAMA described in Section 5. All of the optimized models tend to perform better on LAMA-easy compared to LAMA-hard, and OPTIPROMPT out-performs AU-  We implement OPTIPROMPT based on the Hug-gingFace's Transformers (Wolf et al., 2020) library. During trianing, we use an Adam optimizer and a scheduler with a warmup ratio of 0.1. We use an Adam optimizer and a linear scheduler with a warmup ratio of 0.1. We train our OPTIPROMPT model for 10 epochs with a learning rate of 3e-3 and a batch size of 16. For fine-tuning, we use an Adam optimizer and a linear scheduler with a warmup ratio of 0.1. We fine-tune the language models for 10 epochs with a learning rate of 2e-6 and a batch size of 2.
We report AUTOPROMPT's performance based on the prompts released by Shin et al. (2020). When we apply AUTOPROMPT to a control task (e.g., the Random Embeddings model), or compare AUTOPROMPT with different language models on a different dataset (see Appendix C), we run AU-TOPROMPT for 1000 iterations for each model to search for the prompt of a relation.

B.2 LAMA Classifiers
In Section 4.2 we fit two simple probabilistic models to the Wikidata training data collected by (Shin et al., 2020). Given a relation r and a subject s consisting of tokens w 1 , . . . , w |s| ∈ V, the Class Prior model predictsô = arg max o P (o | r), the object label that is most frequently associated with relation r in the training data. The Naive Bayes model predictsô = arg max o P (o | s, r), with The probabilities are estimated from the corpus with add-one smoothing: w∈V (count(o, w) + 1) .

C Comparing Pre-trained Language Models
We compare different pre-trained language models (BERT, RoBERTa (Liu et al., 2019), and AL-BERT (Lan et al., 2019)) with different probing methods. We collect at most 1000 training samples for each relation from the TRE-x dataset and constrain the object of each sample to be a token for all the models 12 . During testing, we downsample the LAMA test set to make sure that the object in each sample is a single token for all the models. In Table 6 shows the results of different probing methods applied to four pre-trained language models, along with our Random Model baseline. We make the following observations: • Base vs. Large: The larger version of BERT performs better on LAMA than BERT base in the OPTIPROMPT probe. We might hypothesize that BERT-large is simply more capable of finding patterns in the training data, but our baseline result does not indicate that this is the case-on the contrary, BERT-large performs marginally worse on the Random Model baseline. This could lead us to believe that BERT-large truly does store information about 1 or 2% more LAMA facts compared to BERT-base.
• BERT vs. RoBERTa vs. ALBERT: Shin et al. (2020) find that RoBERTa performs significantly worse on LAMA than BERT. We find this is true for our prompts as well (comparing with BERT-large), but the magnitude of the difference decreases in the fine-tuning setting. Our baseline result gives a possible hint as to why: RoBERTa performs better in the RM setting with fine-tuning, indicating that part of the difference between OPTIPROMPT and fine-tuning might be due to better exploitation of training data. This change is even more dramatic in ALBERT. Perhaps these models store less factual information due to pre-training on a wider variety of genres.
We believe that further comparisons along these lines are a promising area of future work-for example, if we could show that probing results are correlated with downstream task performance and use probes to guide model selection.    Table 8: Control result details. The value in each cell is Maj./Acc., where Acc. is the percentage of facts of relation r that the model predicts correctly and Maj. is the percentage of facts s, r, o such that (a) the model predicts o correctly, and (b) o is the most frequent object for relation r in the training data. We probe the BERT-base-cased model, reinitializing either the token embeddings or all of the parameters.