Towards Alleviating the Object Bias in Prompt Tuning-based Factual Knowledge Extraction

Many works employed prompt tuning methods to automatically optimize prompt queries and extract the factual knowledge stored in Pretrained Language Models. In this paper, we observe that the optimized prompts, including discrete prompts and continuous prompts, exhibit undesirable object bias. To handle this problem, we propose a novel prompt tuning method called MeCoD. consisting of three modules: Prompt Encoder, Object Equalization and Biased Object Obstruction. Experimental results show that MeCoD can significantly reduce the object bias and at the same time improve accuracy of factual knowledge extraction.


Introduction
Pretrained language models (PLMs) have become a standard practice in NLP and achieved strong performance on many downstream tasks (Qiu et al., 2020;Liu et al., 2021a).A recognized reason why PLMs are so powerful is the knowledge learned from a large amount of public corpus (Liu et al., 2019a).Recently, researchers have taken interest in measuring and extracting the factual knowledge in PLMs.Petroni et al. (2019) first formally proposed the LAMA benchmark, which employs handcrafted prompts to retrieve factual knowledge in the form of < subject, relation, object > triples.For example, regarding a factual knowledge triple < Douglas Adams, native language, English >, LAMA can query PLMs with "The native language of Douglas Adams is [MASK]" to extract the native language of Douglas Adams, where "The native language of [X] is [MASK]" is a manual prompt for the relation "native language" and "[MASK]" is a placeholder for the object to predict.
In order to extract factual knowledge more effectively, many works take a step toward automatically tuning prompts with additional training set.Shin  2020) proposed AutoPrompt to generate discrete prompts automatically based on gradient optimization by maximizing the expected likelihood of the ground truth object.Instead of searching discrete prompts, a more flexible research line is tuning continuous prompts directly in the input embedding space.For example, Liu et al. (2021b) proposed P-tuning to optimize a continuous prompt for each factual relation, and achieved SOTA performance.Li and Liang (2021) proposed a semiautomatic method called Prefix-tuning to learn a prefix to add to manual prompts.Newman et al. (2022) applied Prefix-tuning to improve the robustness of factual knowledge extraction.
Although the above prompt tuning methods achieve good performance, we discuss in this paper that they suffer from severe object bias problem.As illustrated in Figure 1 b) shows the derived logits of top-k retrieved objects in descending order.Since the subject is masked in the issued prompt template, no context is provided and an even logit distribution for different object candidates is expected.Taking the fact < Douglas Adams, native language, English > for example, objects like "French", "English" and "Russian" should be treated equally when Douglas Adams is masked.However, we observe non-trivial slopes (|w| in Figure 1 (b)) of the regression lines in the 4 examined knowledge extraction methods, i.e., they all exhibit bias towards specific objects.Notably, the 3 prompt-tuning methods of Auto-Prompt, P-tuning and Prefix-tuning, have more inclined slopes and thus exhibit more severe object bias.More object bias measurement results are available at Section 2.1.Given the observed object bias in prompt tuning methods, we further design analysis experiments and find the negative influence of object bias on knowledge extraction accuracy (detailed in Section 2.2).This motivates us to develop solutions to both alleviate the object bias problem and contribute to more accurate factual knowledge extraction.
In this paper, to address the object bias problem in prompt-tuning stage, we propose MeCoD (Maximum entropy and Contrastive learning for object Debiasing) towards unbiased factual knowledge extraction 1 .The basic idea is deriving equalized object predictions with subject-masked prompt, and at the same time disencouraging the biased objects with original prompt.These goals 1 Since continuous prompts are more effective and widely adopted, MeCoD is designed to improve continuous prompt tuning methods, e.g., P-tuning, Prefix-tuning.are realized by a maximum entropy-based Object Equalization module and contrastive learningbased Biased Object Obstruction module, respectively.Figure 1 (c) illustrates the intuitive effect of object bias alleviation.Contributions.We summarize the main contributions of this paper as follows: • We position the object bias problem in prompt tuning-based factual knowledge extraction.
The influence of object bias on knowledge extraction accuracy is also discussed.
• We propose an object debiasing method at the prompt tuning stage to alleviate the object bias and improve accuracy of factual knowledge extraction.
• The effectiveness of the proposed method is validated with sufficient qualitative and quantitative experiments.

Data Analysis
Object Bias Definition.Factual knowledge can be represented in form of <subject, relation, object> triples.Object bias in factual knowledge extraction refers to the phenomenon that the pretrained language model with prompts retrieves object candidates unequally when subject is not assigned, e.g., prefering "French" to "English" in the prediction of person's native language when the person is not specified.

Object Bias Measurement.
Object bias inherently considers the uncertainty of retrieved objects with subject-masked prompt queries.We thus employ entropy (Shannon, 1948)  in this work to measure object bias.Specifically, we define object bias entropy in terms of the relation R as: where p R is obtained by selecting top-k subjectmasked logit values and normalizing with softmax function, k denotes the number of logit values used to calculate entropy.In our subsequent analysis, we set k to 10, and entropy(R) will achieve a maximum value of about 2.305 when the object logits are equal.The smaller the value, the more significant the object bias.We measure the 4 typical factual knowledge extraction methods on the LAMA benchmark according to Eqn.1.The averaged result over 41 relations is shown in Table 2.It is easy to find that the observation in the form of object bias entropy is consistent with that of slope in Figure 1: (1) The 4 methods all exhibit object bias, the entropy values noticeable decrease from 2.305 by 9%, 17%, 23% and 13%, respectively.(2) The object bias entropy of prompt tuning methods, including AutoPrompt, Prefix-tuning and P-tuning, is more smaller than that of manual prompts, LAMA.This observation further demonstrates that prompt tuning methods suffer more serious object bias than manual prompts.Note that the object bias entropy of Prefix tuning falls in between that of manual prompts, LAMA and full-automatic prompts, Au-toPrompt and P-tuning.A possible reason is that the manual template in the Prefix-tuning limits the learning of object bias.

Influence on Knowledge Extraction
In order to investigate the influence of object bias on knowledge extraction, we compare the retrieved object candidates from original prompts and subject-masked prompts.Specifically, we examine the rank of ground-truth object in the retrieved   object lists and calculate Pearson correlation coefficient between the rank corresponding to the original prompt and the subject-masked prompt.
According to the result on all testing samples, we find that the correlations of prompt tuning methods are higher than that of LAMA (Table 3).This illustrates the prediction results of prompt tuning methods are influenced more significantly by object bias than that of manual prompts.Furthermore, comparing between the results of all testing samples and incorrectly predicted samples, it is easy to observe that correlation coefficient of LAMA remains almost unchanged, while those of the 3 prompt tuning methods increase obviously in incorrectly predicted samples.This suggests that some of the incorrect predictions can be attributed to the bias towards specific objects, and motivates us to address the object bias problem to further improve knowledge extraction accuracy.
The above observations demonstrate the necessity for object debiasing.Object bias is attributed to both the pretraining stage of PLMs and prompt tuning stage.The observed object bias of LAMA mainly origins from the pre-training stage.While some pilot studies (Guo et al., 2022) are devoted to reducing bias in the pre-trained language models, we found in the above analysis that the object bias at prompt-tuning stage is more severe than that at the pre-training stage, and thus focus on object debiasing at the prompt tuning stage in this paper.The straightfoward cause of object bias at the prompt tuning stage is imbalanced training data for optimizing prompts.Take P-tuning as an example, we make a preliminary attempt by retraining it with undersampled balanced training set.In order not to excessively reduce the number of samples, we first group the training samples according to objects, and then randomly undersample the two groups with the largest number of samples.In this case, their numbers are consistent with the number of the third largest group.Table 4 summarizes the performance of P-tuning trained with different training sets.
We observe that object bias is alleviated on undersampled training set, and this validates the attribution of imbalanced tuning set in deriving object bias.However, we find that the P@1 drops clearly due to insufficient use of data, that is, accuracy is sacrificed for debiasing.This inspires us to design an effective prompt tuning method, instead of simply balancing training data, to alleviate object bias as well as improve accuracy performence.

Methodology
We present the overall framework, MeCoD, as illustrated in Figure 2. The basic idea is to improve prompt encoder so that issuing the resultant embeddings to popular PLMs no longer exhibits object bias.The goals are two-fold: (1) Objects should have equal opportunities to be extracted from PLMs by subject-masked prompt; (2) The biased objects should be prevented from being extracted by the original prompt with specified subject.Correspondingly, MeCoD includes three modules.The first module is Prompt Encoder to be optimized, which takes original and subject-masked prompts as in-puts and issues the resultant embeddings to PLMs to obtain mask hidden state h o and subject-masked logits p m respectively (see Section 3.1).The second module is Object Equalization, which takes subject-masked logits p m as input and forces model to treat objects equally, when the subject is masked (see Section 3.2).The third module is Biased Object Obstruction which further prevents the biased objects from being extracted by forcing the mask hidden state h o away from biased object embeddings and close to ground-truth object embedding (see Section 3.3).We will take P-tuning as an example to elaborate the details of each module below.

Prompt Encoder
In this module, we first construct subject-masked input by replacing "[X]" with "[MASK]" , as shown in Table 1.The number of "[MASK]" is set as the number of tokenized subject.As shown in Figure 2 (left), given original prompt I o and subject-masked prompt I m , we use Prompt Encoder to get input embeddings e o and e m for PLMs.Then, mask hidden state h o , subject-masked logits p m and the MLM (Masked Language Modeling) loss L mlm , can be obtained from PLMs as follows: where y i denotes the ground truth.p m will be used to equalize the objects with respect to subjectmasked prompt in Section 3.2.Both h o and p m will be employed to obstruct the influence of biased objects in Section 3.3.

Object Equalization
According to data analysis in Section 2, we consider that the probabilities of object candidates should be equalized when issuing the subjectmasked prompt to PLMs.In this subsection, we will introduce a method based on maximum entropy to force objects to be treated equally when subject is not given.Note that only the fact-related objects need to be considered, e.g., regarding relation P103 (native language), objects with risk to bias the prediction results are like "English", "French" instead of "apple".To filter out the unrelated objects, we first sort subject-masked logits p m with descending order to get p d , and empirically reserve the top-300 objects c o .Then, we further employ a linear layer as a binary classifier named Object Selector to identify the objects to be equalized.Specifically, the object selector takes the object embeddings as input and returns a binary vector v ∈ {0, 1} 300 with gumbel softmax (Jang et al., 2017) as follows: where E denotes embedding layer of PLMs.The object sets corresponding to v(i) = 1, i = 1, 2, ..., 300 are selected to be equalized.Finally, we construct the loss L me based on Maximum Entropy: (4)

Biased Object Obstruction
This module further reduce the probability of retrieving biased objects when issuing prompt with specified subjects, and we introduce a module based on contrastive learning.The key idea is to simultaneously minimize the representation gap between "[MASK]" and ground-truth object and maximize that between "[MASK]" and irrelevant biased objects.Specifically, we regard the objects corresponding to v(i) = 1, i = 1, 2, ..., 300 as biased objects except for the ground-truth object.We formalize it as a contrastive learning problem and propose to minimize the following loss (van den Oord et al., 2018): e b e sim(ho,e b )/τ + e sim(ho,eg )/τ , (5) where sim(•) calculates the consine similarity of different representations, e g and e b denote the word embeddings of ground-truth object and biased objects, respectively.τ is the temperature, controlling the difficulty of distinguishing between positive and negative samples.Intuitively, the contrastive loss forces the model to push the mask hidden state away from the embeddings of biased objects, and pull it to the ground-truth object embedding.
During training, the model is optimized by jointly minimizing loss L as follows: where λ 1 and λ 2 are the coefficients to balance the three training losses.
4 Experiments Evaluation metrics and baselines.In addition to object bias entropy, we also evaluate the performance on knowledge extraction with metrics of precision-at-1 (P@1) and mean reciprocal rank (MRR).We report average performance over 41 relations.In order to evaluate the effectiveness of the proposed MeCoD, we implement the LAMA, Prefix-tuning, P-tuning and Undersampling-based solutions (see Section 2) as baselines.Specifically, LAMA provides hand-crafted prompts that are less object-biased than prompt tuning methods.P-tuning and Prefix-tuning are the representatives of continuous prompt, which are more effective and widely used.
Implementation details.For PLMs in our experiments, we investigate BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b) 2 .The prompt encoder is initialized by standard training prompt encoder, e.g.,P-tuning based on LSTM (Shi et al., 2015).We use the Adam optimizer (Kingma and Ba, 2014) with its default configuration   the learning rate to 1e-5 to jointly optimize Prompt Encoder and Object Selector.We set λ 1 and λ 2 to 0.2, 0.1 respectively.As for training time, take MeCoD on P-tuning as an example, it spends about 40 minutes for each relation on 1 GPU device of RTX A4000.

Quantitative Results
Table 5 shows the performance of each method on the two selected PLMs.Briefly, MeCoD outperforms the baselines on both object bias and knowledge extraction performance.Take P-tuning as an example, enhanced with MeCoD, object bias entropy increases by 22% and 27%, and P@1 increases by 2% and 4% for BERT and RoBERTa, respectively.This demonstrates that alleviating bias contributes to improving the accuracy performance.Similar results are observed in Prefix-tuning.This demonstrates the generality and effectiveness of our method.The results of Undersampling-based methods illustrate that accuracy is sacrificed for debiasing.The improvement is also reflected in the evaluation results on WIKI-UNI dataset, as shown in Table 9 and Table 10 in AppendixA.1.

Ablation Study
In order to clarify the source of performance improvement in MeCoD, we take P-tuning as an example and conduct ablations by removing particular modules from MeCoD.The ablation study results are shown in Table 6.We can conclude that (1) Object Equalization plays a crucial role in alleviating bias, as removing the module causes object bias entropy to decrease.The decreased accuracy shows that Object Equalization module is helpful to improve the accuracy by alleviating the object bias.
(2) Biased Object Obstruction is useful for ensuring accuracy, because the contrastive loss forces model not to be affected by biased objects.But it does not alleviate object bias clearly, which may cause other unexpected problems, so it's not recommended to be used alone.

Case Study
In order to better understand how MeCoD contributes to alleviating object bias, as shown in Figure 3, we visualize the regression lines of subjectmasked logits of P-tuning and Prefix-tuning on relation P178 (developer) and relation P103 (native language), respectively.The lines corresponding to MeCoD are relatively flatter, that is, the object bias is alleviated clearly by our method, which is consistent with quantitative results in Table 5.Furthermore, we investigate the influence of alleviating object bias on knowledge extraction by observing the top-k candidates extracted by original prompt and subject-masked prompt respectively on two fact samples, as shown in   for example, by observing the results extracted by subject-masked prompt, we find that the candidates' logits of MeCoD are more even, which means less object bias.However, P-tuning shows obvious bias on some objects like "Atari", "IBM".Correspondingly, as for the results extracted by original prompt, MeCoD correctly predicts "Boeing", but P-tuning predicts incorrectly on biased object "Atari", and ranks the correct object at the second place.Similar results are observed in Prefix-tuning.Therefore, we conclude that object bias is responsible to such incorrect predictions.

Discussions
Our experiments show that object bias of prompt tuning methods is undesirable.This inspires us to investigate how prompt tuning methods extract factual knowledge, and explore the potential cause of object bias.Specifically, we take P-tuning for example and illustrate in the following discussions on relation P19 (place-of-birth).
Finding Nearest Neighbors.In order to figure out the implication of the prompt token embeddings, we follow (Lester et al., 2021) and find their nearest neighbors from the frozen model's vocabulary.As shown in Table 11 3 , we observe that the prompt tokens in close position exhibit similar patterns.This indicates the tokens in different positions probably play different roles, but we can not understand the concrete meaning by the observation of their nearest neighbors.Therefore, we further analyze the the candidate words about prompt tokens, which is returned by masked language modeling (MLM) of PLMs.
Checking MLM Candidate Words.Table 8 illustrates the MLM candidate words of prompt tokens.Specifically, the subject is set to "Claude Arrieu" who was a prolific French composer born in "Paris".Two interesting observations include: (1) The MLM candidate words of front and back prompt tokens are mostly punctuations, while the front also include articles.This indicates that prompt tuning methods mimic human linguistic expression to some extent.For example, we often start a sentence with the article "the" and end it with the punctuation ".".(2) The results of middle prompt tokens exhibit literal correlations between "subject" and "[MASK]".Specifically, it is easy to find that the candidate words of Middle-1 are related to the object, for example "London" and even the ground-truth object "Paris".Note that the Middle-1 is the closest to "[MASK]".As for Middle-3, some candidate words can form the names of famous people with the first token of the subject, e.g., "Albert Claude", "Victor Claude", "Robert Claude".Note that the Middle-3 is the closest to "subject".The candidate words of Middle-2 can be seen as a mixture of the above two type words.By combining the above two observations, we conclude that prompt tuning methods extract factual knowledge by depending on shallow literal correlations rather than factual relations.Furthermore, we consider that the shallow correlations is one of the potential cause of object bias.This needs to be demonstrated by rigorous experiments and analysis in the further works, and we just conjecture it intuitively here.
5 Related work

Conclusion
In this work, we position the object bias problem in prompt tuning-based factual knowledge extraction, and propose MeCoD, a framework for alleviating the object bias and improving accuracy of factual knowledge extraction.Experimental results demonstrate the usefulness and generality of MeCoD.Besides, we argue that the shallow association learned by prompt tuning is one potential cause of object bias.In the future, we are working towards exploring the mechanism behind deriving object bias, designing more reliable prompt tuning methods for factual knowledge extraction and investigating the problems on generative pre-trained models, e.g., GPT2 (Radford et al., 2019), GPT3 (Brown et al., 2020).

Limitations
In this paper, we focus on masked language models, which have been shown very effective and are widely used.One limitation of the present study is not investigating another representative category of language models, the generative pre-trained models (e.g., GPT2/3 ( Radford et al. (2019); Brown et al. (2020))), We leave it for future work.

Figure 1 :
Figure 1: Object bias in different prompt-based knowledge extraction methods: LAMA, AutoPrompt, Prefixtuning and P-tuning.(a) demonstrates how to construct subject-masked prompts.(b), (c) show the derived logits of top-retrieved objects for original and our proposed prompt-tuning methods, respectively et al. (2020) proposed AutoPrompt to generate discrete prompts automatically based on gradient optimization by maximizing the expected likelihood of the ground truth object.Instead of searching discrete prompts, a more flexible research line is tuning continuous prompts directly in the input embedding space.For example, Liu et al. (2021b) proposed P-tuning to optimize a continuous prompt for each factual relation, and achieved SOTA performance.Li and Liang (2021) proposed a semiautomatic method called Prefix-tuning to learn a prefix to add to manual prompts.Newman et al. (2022) applied Prefix-tuning to improve the robustness of factual knowledge extraction.Although the above prompt tuning methods achieve good performance, we discuss in this paper that they suffer from severe object bias problem.As illustrated in Figure1(a), we construct subjectmasked prompts for different prompt-based knowledge extraction methods.Example prompts for relation P103 are illustrated in Table1.We conduct experiments on LAMA benchmark(Petroni (a), we construct subjectmasked prompts for different prompt-based knowledge extraction methods.Example prompts for relation P103 are illustrated in

Table 7 :
Case Study: Top-k object candidates extracted by original prompt and subject-masked prompt about two fact samples.The first (middle row) is the result about "the developer of B-17 Flying Fortress".The second (bottom row) is the result about "the native language of Douglas Adams".The bold fonts indicate ground truth, the numbers in parentheses are ranks of object candidates and the numbers below objects are the corresponding logit values.

Figure 3 :
Figure 3: Case Study: The regression lines of subjectmasked logits.(a) shows the result predicted by Ptuning and MeCoD on relation P178.(b) shows the result predicted by Prefix-tuning and MeCoD on relation P103.

Table
The native language of Pierre Messmer is [MASK] .Subject-masked The native language of [MASK] is [MASK] .
. We conduct experiments on LAMA benchmark(Petroni

Table 1 :
Example of original and subject-masked prompt templates for relation P103."[T]","[P]" indicate discrete and continuous optimizable prompt token, respectively.The number of [P] and [T] can be customized."[MASK]" in bold is the placeholder for the object to predict.

Table 3 :
The Pearson correlation coefficient between the rank corresponding to original prompt and subjectmasked prompt.

Table 4 :
Results for P-tuning fit to original dataset and undersampled dataset.

Table 5 :
Results on the LAMA benchmark using the BERT-base-cased and RoBERTa-base model.
Newman et al. (2022)1) et al. (2022)questioned the previous conclusion by investigating the behaviors of MLMs, that current MLMs can potentially serve as reliable factual knowledge bases.In this paper, instead of knowledge storage mechanism of the PLMs, we mainly investigate the object bias problem in factual knowledge extraction during prompt tuning stage.Factual Knowledge Extraction.In addition to manual prompts, many researchers explored more effective methods for factual knowledge extraction.Jiang et al. (2020)mined prompts through text mining and paraphrasing.Recently, researchers engage in prompt tuning methods(Haviv et al., 2021)(Qin and Eisner, 2021).For example,Shin et al. (2020)trained a model to generate prompts automatically based on gradient optimization.Liu et al. (2021b)proposed P-tuning which completely abandoned natural language forms and optimized a continuous prompt for each factual relation.Li and Liang (2021)proposed a semi-automatic method called Prefix-tuning to learn a prefix to add to manual prompts.Newman et al. (2022)applied Prefixtuning to improve the robustness of factual knowledge extraction.In this paper, we observe that prompt tuning methods suffer serious object bias, and propose a framework MeCoD to alleviate it.