SPE: Symmetrical Prompt Enhancement for Fact Probing

Pretrained language models (PLMs) have been shown to accumulate factual knowledge during pretraining (Petroni et al. 2019). Recent works probe PLMs for the extent of this knowledge through prompts either in discrete or continuous forms. However, these methods do not consider symmetry of the task: object prediction and subject prediction. In this work, we propose Symmetrical Prompt Enhancement (SPE), a continuous prompt-based method for factual probing in PLMs that leverages the symmetry of the task by constructing symmetrical prompts for subject and object prediction. Our results on a popular factual probing dataset, LAMA, show significant improvement of SPE over previous probing methods.


Introduction
Prompt-based learning proposes to formulate different NLP tasks into language modeling problems (Schick and Schütze, 2021).It is a novel paradigm that effectively uses Pretrained Language Models (PLMs) (Liu et al., 2022), and achieves comparable or better performance than fine-tuning (Lester et al., 2021).Prompt-based learning has also been used for the task of factual knowledge probing in PLMs.In this task, the goal is to predict the (masked) object of factual tuples of type (subject, relation, object) using PLMs.Prompting methods assume that PLMs gather and store factual knowledge during their pre-training, and cloze-style prompts can be used to probe PLMs to gauge how much knowledge they contain (Petroni et al., 2019).The prompts are either handcrafted (Petroni et al., 2019;Bouraoui et al., 2020) or automatically generated (Shin et al., 2020;Haviv et al., 2021).For example, to probe PLMs about their knowledge of geographic location of Luxembourg, a prompt can be formed by filling Luxembourg in the first blank of the following template: "____ is located in ____.".An Figure 1: Example of factual probing: Given a subject and relation, predict the object.SPE uses a fixed template to generate a prompt for predicting object given subject (green box) as well as several symmetrical prompts for predicting the subject given object candidates (yellow boxes).The final prediction is obtained using the likelihoods of the object candidates and of the given subject as obtained using the symmetrical prompts.Bars represent probabilities from BERT.SPE is a continuous prompt method but we use natural language prompts and template here for illustration.effective prompt will probe the PLM to output Europe as the most likely prediction for the second blank.Such methods are promising but brittle.Minor changes in the template can lead to significant difference in the performance (Jiang et al., 2020).Recent works have shown that continuous prompts obtained via gradient-based learning, are more effective and robust than discrete prompts since there are less restrictions on the search space (Liu et al., 2021;Qin and Eisner, 2021;Zhong et al., 2021;Liu et al., 2022;Newman et al., 2022).
Existing methods for learning prompts do not leverage the symmetry inherent in the task's definition.For example, while Luxembourg is located in Europe, Europe contains Luxembourg.Similar ideas have been used for learning prompts for relation classification (Han et al., 2021) and other NLP tasks (Crawford et al., 1996;Kiddon and Domingos, 2015;He et al., 2017;Tanchip et al., 2020).
In this work, we propose Symmetrical Prompt Enhancement (SPE)-a continuous prompting method that learns prompt that incorporates the above mentioned symmetry.Specifically, in addition to generating a prompt to predict the object given the subject, SPE also generates an additional symmetrical prompt to predict the subject given the object.Using the first prompt (see green box in Fig. 1), SPE obtains a few high-probability candidate objects like Germany, France, and Europe.Thereafter, for each object candidate, it generates a symmetrical prompt (see yellow boxes), and obtains the likelihood of the subject, Luxembourg.At the heart of SPE is a prompt generation model that is trained by maximizing the joint likelihood of both the candidates as well as the subject (given the candidates).Our experiments on the factual probing dataset LAMA (Petroni et al., 2019) show that SPE achieves significant improvement over previous approaches and our analysis points to sources of this performance gain.These experiments demonstrate that like SPE, probing methods should learn prompts that leverage the symmetry of the task because that can help PLMs in producing better answers when they are being probed for stored factual knowledge.

Symmetrical Prompt Enhancement
The goal of factual probing via prompt generation is to output object O for given subject I and relation R by constructing a prompt P. Most methods operate by assuming a template T , and generating the prompt P from T , I and R. Fig. 1 shows an example of Subject (Luxembourg), Relation (location), Object (Europe), Template (____ is located in ____.), and the corresponding Prompt (Luxembourg is located in ____.).The figure shows a natural language template and prompts for readability.However, for continuous prompt methods like ours, the template is a sequences of vectors like We refer to the two blanks as B O and B I .The prompt, P orig , is typically generated by learning these vectors and filling the (representation of) I in B I .The prompts are relation-specific (P R orig ) but here we refer to them as P orig for simplicity.The model's prediction, Ô, is the most likely object candidate for the B O as determined by the PLM using P orig .
Our proposed approach, Symmetrical Prompt Enhancement (SPE), leverages the inherent symmetry of the task.Specifically, in addition to learning the original prompt P orig for predicting the object given the subject, SPE also generates several symmetrical prompts, P sym , for predicting the subject given the object.Like P orig , P sym is also generated from T except that this time B O is filled by the (representation of) O.The prompt is used for probing the PLM which outputs prediction for B I .
Here p(v|P) is the probability distribution of word or phrases v in PLM given prompt P as input.The model is trained by optimizing a linear combination of the cross-entropy objectives of predicting the object O and the subject I: where λ is a hyperparameter.θ, the parameters of the prompt generation model, are learned.
For inference, SPE selects top K predictions C K : and uses each prediction c k ∈ C K as a candidate to generate the symmetrical prompt P k sym .Finally, the model's prediction Ô is: (5) In practice, L and P PLM are normalized by input length to account for inputs with multiple tokens.

Implementation Details
We conduct experiments on the fact retrieval part of LAMA dataset (Petroni et al., 2019), which consists of fact triples with single-token objects from 41 relations in Wikidata (Vrandečić and Krötzsch, 2014).We use the training set extended by Shin et al. (2020).We choose masked language models BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) as PLMs, which are fixed during training to serve as static knowledge bases.For implementation, we use PLMs in Huggingface library of Transformers (Wolf et al., 2020).We follow Liu et al. (2021) for designing templates and the prompt generation component of our model.In particular, we use BiLSTM (Graves et al., 2013) with multilayer perceptron (MLP) for prompt generation Model BERT-base BERT-large RoBERTa-base P@1 P@10 MRR P@1 P@10 MRR P@1 P@10 MRR Manual (Petroni et al., 2019)  and use the following generic and relation-agnostic format for template, T : The model and the template are randomly initialized.
For I with multiple tokens, we mask them one token at a time to generate P sym , and use the average of pseudo likelihoods from all P sym s to represent log p(v ′ = I|P sym ).In practice, we find that masking one token at a time is better than masking the entire phrase at once, and averaging the pseudolikelihood has better performance.The training batch size is 8.We set K to be 15 during inference, and λ to be 0.8 based on our experiments on the development set.The results are evaluated by accuracy at top 1 (P@1) and top 10 (P@10) predictions, and Mean Reciprocal Rank (MRR) as in Qin and Eisner (2021).Appendix A includes more setup details and discussion on choice of λ.

Results
We compare our results with both discrete and continuous prompt methods.Discrete prompt methods include prompts from manually designed templates (Petroni et al., 2019); LPAQA (Jiang et al., 2020), which uses text mining based prompts; and Au-toPrompt (Shin et al., 2020), which uses discrete lexicalized trigger tokens for prompt generation.Continuous prompt methods include P-tuning (Liu et al., 2021), which uses a neural network to generate prompts; OptiPrompt (Zhong et al., 2021), which uses manually initialized prompts; and Soft-Prompt (Qin and Eisner, 2021), which ensembles multiple prompts initialized with mined templates.
Quantitative Results: Table 1 shows the performance of SPE and all baselines.The results show that SPE outperforms all previous methods.Note that, unlike OptiPrompt and SoftPrompt, SPE does not make use of manually designed templates for initialization.We also find that SPE outperforms the baselines when the PLM parameters are updated jointly with the prompt tokens on the training data.See Table 5 in the Appendix B.2 for detailed results.For the rest experiments, we consider Ptuning as primary baseline since it is the best performing model that is directly comparable to SPE.
Effect of candidates pool size: Table 2 shows how SPE performs with different candidate pool sizes.Comparing the first two rows we can see that SPE outperforms our primary baseline, P-tuning, even without reranking (K=1).Increasing the size of the candidate pool leads to further improvements.However, expanding the candidate pool has a trade-off between performance and memory usage.Meanwhile, applying reranking on the discrete prompt methods mentioned does not introduce performance gain, mainly because their prompt templates are selected or mined in favor of object prediction only.We leave the investigation of constructing discrete prompts that benefits from the symmetry as future work.LAMA test set has also been split into LAMA-Easy and LAMA-Hard where objects in the LAMA-Easy split can be "guessed" by naive methods (Zhong et al., 2021).We observe that SPE outperforms the baselines in P@1 for both splits and its gain over P-tuning for LAMA-Hard (4.2%) is larger than LAMA-Easy (1.5%) (see  3 from P-tuning (top half of each row) and SPE (bottom half of each row).The correct answers are underlined, and their ranks in the predicted lists are in the last column.We observe that SPE's top predictions are in the correct domain.

Performance on Easy and Hard examples: The
For example, SPE outputs BBC for the employer of British-Irish actor Spike Milligan (as opposed to Microsoft, IBM, and Google) as outputted by P-tuning), and Hindi along with other Indian languages, when asked about the original language of an Indian movie Baaz, rather than Turkish, a non-Indian Language.Moreover, SPE correctly identifies the country of citizenship for Rubens Barrichello as Brazil.Identifying objects for relations like country of citizenship for individuals are challenging because documents with the individual's names in the pretraining corpus of PLMs might contain mentions of multiple places he/she has worked or lived or received education in.Therefore, these co-occurrences might confuse PLMs.In Appendix C, we identify such confusing relations and conduct a close analysis on them.
We also find that SPE's predictions (e.g.opera for the field of work of Richard Wagner) are more precise that P-tuning's (music).In general, PLMs predictions for a relation can get affected by related high-frequency but incorrect object candidates.Previous prompt methods are found to suffer from bias of the prompt and object distribution in the dataset (Cao et al., 2021).To investigate this, we identify a set of relations that are prone to such spurious frequency-related associations (see Appendix C for the list of relations) and find that SPE especially performs well on such relations (see Figure 3 and Appendix C.1).We also plot the mean (log 10 ) token frequencies of top predictions of different methods as well as the oracle for these relations in Figure 2 (using word frequencies from Speer et al. (2018)).We observe that SPE's predictions (green bars) have lower frequencies than most baselines including P-tuning (red bars).Meanwhile, the frequencies of SPE's predictions are in general more similar to that of the oracle (blue bars) than most baselines.This indicates that even though the correct (and more precise) answers have lower frequencies, SPE can output them as answers while the baselines output the more frequent alternatives as answers (see Appendix C.2 for examples).We further extend this analysis to the top M predictions and observe similar behavior (see Appendix C.3). Lastly, the outputs of SPE are less affected by the most frequently occurring objects in the dataset (see Appendix C.4).

Limitations
We note that SPE may not help if the correct objects are broad concepts (e.g."mathematics" vs "algebra", "river" vs "tributary", "FIFA" vs "UEFA").Typical relations with such objects include P279 subclass of, P361 part of and P463 member of.The top 5 predictions by SPE (and also P-tuning) for the subclass of relation are shown in Table 3.The correct answer, river, is ranked 3rd by SPE and an incorrect answer, tributary, is the top prediction.P-tuning outputs the correct answer.
Also, in general, SPE can get affected by error propagation because of its two-step inference process that first predicts object candidates and then ranks them.
Though the proposed symmetrical prompt method improves knowledge probing, the utility of the technique in other NLP tasks is not yet investigated.Besides, the experiments are only conducted for masked language models but there has been recent progress in other types of language models which are not explored in the paper.Lastly, the proposed method requires additional computational cost compared the baselines.

Conclusion
This work introduces Symmetrical Prompt Enhancement (SPE) -a continuous prompt-learning method for factual probing of PLMs by learning prompts that utilize the inherent symmetry of the task.Our experiments show that SPE outperforms existing SOTA methods thereby helping us know more about how much knowledge is stored in a PLM.Future work could explore this idea of using task symmetry for other NLP tasks. 1 1 Code is available here.

Ethical Consideration
In this work, we propose SPE, which incorporates the symmetrical nature of factual knowledge in prompt methods.Our result shows the effectiveness of SPE over several previous prompt baselines.Even though we work on the factual knowledge dataset, we notice that current PLMs does not have the awareness to distinguish between publiclyavailable factual knowledge and private information (which is not considered as knowledge) either during the pre-training or inference, while the memorizing information of PLMs in latter lead to potential risk of privacy leakage (Carlini et al., 2021).All the experiments are conducted on the publicly available dataset, which is mainly based on Wikidata.).The experiments require 20 hours to finish on a single Tesla V100 GPU.Also, during our experiments, we experiment with having separate prompt generation models for generating P orig and P sym .However, we find that training one prompt generation model for both P orig and P sym led to better results.Choice of λ: In our preliminary experiments on the development set, we find λ = 0.8 to be the best choice among [0, 1].However, we observe that the performance is not very sensitive to λ and λ > 0.4 generally gives a reasonable performance.

B.1 Easy and Hard LAMA Examples
Zhong et al. ( 2021) points out that during factual probing, a PLM's predictions can be based on shallow patterns in the training data instead of the knowledge stored in the PLM.To study this phenomena, they propose an easy (LAMA-Easy) and a hard (LAMA-Hard) split of the LAMA dataset where objects in the LAMA-Easy subset can be "guessed" by naive or non-pretrained models.We compare SPE with the baselines on these two subsets and report results in Table 4.We observe that, in general, all methods achieves better performance in LAMA-Easy that the complete testset but SPE has the highest P@1.It is outperformed on P@10 and MRR only by Softprompt and Optiprompt but they use manually designed templates.More importantly, SPE shows higher improvement in LAMA-Hard compared to baselines especially with respect to P-tuning (4.2% in P@1).This shows that the improvement of SPE does not simply come from native language (N-1).The first, second and third row represents the P@1 improvement SPE has over Optiprompt, SoftPrompt and P-tuning respectively.SPE outperforms all three probing methods for most relations.
shallow pattern matching and it is better at handling more challenging knowledge probing cases.

B.2 Finetuning PLMs
In the experiments reported in the paper, the PLMs are fixed during training and only the prompt generation model is being trained.We now experiment with also finetuning the PLMs.We use BERTlarge-cased for this experiment.The results are reported in Table 5 where we compare SPE with the comparable gradient-based baselines.We can see that SPE outperforms those baselines in this setting also.The drop in P@10 of AutoPrompt compared to its P@10 when PLM is fixed (see Table 1) may be related to its discrete token substitution (non-gradient-descent) design, which is harder to optimize.

C Analysis on Relations with Spurious Associations
Recent works have shown that frequency bias exists in maximal likelihood estimation training of language models (Ott et al., 2018;Jiang et al., 2021) and how a PLM's learning of a word is related to its frequency (Chang and Bergen, 2022).Cao et al. (2021) observed that prompts for fact probing overfit the object distribution more than the relation.As a result, in factual probing, PLM's output might get affected by the frequencies of output candidates.This is especially true for relations that are prone to spurious associations of the subject with candidate objects or over-representation of candidate objects.Below, we identify some such relations and then analyze performance of SPE with respect to the baselines on these relations.
R1 Relations with scope associations (P101 field of work).When probing factual knowledge from PLMs, the object of a subject-relation pair forms the correct answer.While there can be multiple reasonable answers, some are more precise and so more desirable than others.For instance, for describing the field that Richard Wagner worked in (see Table 3), both opera and music seem to be reasonable answers but opera is the more precise one.
In such relations, different object-candidates may entail similar meanings but be of different scope.
R2 Relations with entity-type associations (P19 place of birth, P20 place of death, P27 country of citizenship, P364 original language of film or TV show, P495 country of origin, P1412 language spoken, P937 work location).Some relations are about objects with specific constraints.For example, place of birth and place of death are the first  3).

C.1 Comparison of SPE with Baselines on Relations with Spurious Associations
We observe that for R1, R2 and R3 category relations, SPE especially outperformed the baselines in most cases (see Figure 3).The first, second and third rows of the figure represent the corresponding P@1 improvement (scale of 100) of SPE over Optiprompt, SoftPrompt and P-tuning respectively and a darker color means larger improvement.
To further investigate these improvements, we explored the correlation between predictions of different prompt approaches and their token frequencies.We analyzed relations affected by co-occurences of subjects with spurious object candidates, i.e. relations of type R1 and R2.We acquired word frequencies from Speer et al. (2018)  ple, in the field of work of Richard Wagner on Table 3, the log word frequency of opera is -4.73 but the log frequencies of music, history, philosophy and psychology are -3.48,-3.61, -4.51 and -4.69 respectively, which have higher word frequencies than opera (especially, the frequency of P-tuning's output music is 17 times higher).Similarly, in the case of original language of Bazz, the log frequencies of Hindi, Urdu, Punjabi are -5.18,-5.74, -5.84, while for non-Indian languages like Turkish, English, and French they are -4.71,-3.81 and -3.91, which means these frequencies are at least 10 times higher than the Indian languages.Yet, SPE outputs the correct, even though less frequent answers.

C.3 Investigating Token Frequencies of
Top-M Predictions.
We now extend the above-mentioned analysis from top predictions to top M predictions and analyze if SPE can help the PLMs output less frequent tokens as answers.In particular, for different prompting approaches, we consider their top M predictions and compute the Rank Weighted Frequency (RWF) using the following formula, where C M n is the n-th candidates among the top M predictions.A lower RWF indicates less association between token frequencies and top predictions.Results are shown in Figure 4 with M=10.We can see that for most relations, SPE has a lower RWF than baselines, especially P-tuning.These experiments indicate that SPE can mitigate the frequency bias inherently contained in PLMs and avoid answers with spurious associations with the subjects.

C.4 Investigating Percentage of Majority
Label in Predictions.
The analyses shown in Appendix C and C.3 focus on relations of type R1 and R2.We now focus on relations of type R3, i.e. relations affected by label imbalance.Results in Figure 5 show that SPE predicts majority training labels less frequently than the baselines in most relations, demonstrating that it is less affected by the imbalances in the label distribution.

Figure 2 :
Figure 2: Frequencies of predictions of different methods.SPE can output answers that have low frequencies.

Figure 5 :
Figure 5: Comparison of different methods in percentage of majority training label in test predictions for R3 relations.A lower value means the method is less affected by the label distribution in the training set.Invisible bars are overlapped with the red ones.SPE (green) has lower values than the baselines in most relations.

Table 1 :
SPE outperforms state-of-the-art discrete and continuous prompt approaches on the LAMA dataset.
Table 2: Effect of varying size of candidate pool on SPE's performance.SPE outperforms P-tuning even without reranking (K=1).A larger candidate pool helps the model even further.

Table 3 :
Sample outputs of P-tuning (PT) and SPE.The ranks of correct answers (underlined) are in the last column.

Table 4 :
SPE outperforms the baselines on both hard and easy examples with greater improvement on the hard ones.
are Oceania (minority class).P-tuning is probably affected by this imbalance and outputs the majority label, Antartica, as the continent that contains Marshall Islands while Oceania, the correct answer, appears at rank 4 (see Table