Co 2 PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning

Pre-trained Language Models are widely used in many important real-world applications. However, recent studies show that these models can encode social biases from large pre-training corpora and even amplify biases in downstream applications. To address this challenge, we pro-pose Co 2 PT, an efficient and effective debias-while-prompt tuning method for mitigating biases via counterfactual contrastive prompt tuning on downstream tasks. Our experiments conducted on three extrinsic bias benchmarks demonstrate the effectiveness of Co 2 PT on bias mitigation during the prompt tuning process and its adaptability to existing upstream debi-ased language models. These findings indicate the strength of Co 2 PT and provide promising avenues for further enhancement in bias mitigation on downstream tasks.


Introduction
Pre-trained language models (PLMs) are widely used in many real-world applications, demonstrating remarkable performance (Devlin et al., 2019;Brown et al., 2020).However, it has been demonstrated that PLMs encode unfair social biases in their parameters based on their pre-training step over large-scale text corpora (May et al., 2019).Furthermore, these biases -for example, based on gender, race, or religion -can easily propagate to the downstream tasks that use these PLMs (Kaneko and Bollegala, 2021).For example, "She is a nurse" can have a higher conditional likelihood than "He is a nurse" in the language modeling task, and "nurse" can have higher coreference scores to "she" than "he" in the coreference resolution task (Lu et al., 2020).Considering that NLP applications like machine translation systems, resume filtering systems, dialogue systems, and speech recognition (Tatman, 2017) are widely used by millions of users globally, it is crucial to mitigate the social biases present in PLMs and strive for models that will not propagate discriminatory predictions or offensive outputs towards specific groups before being deployed.
Much prior effort has focused primarily on debiasing the representations learned during the pretraining process, e.g., through projection (Dev et al., 2020;Liang et al., 2020;Ravfogel et al., 2020;Kaneko and Bollegala, 2021), further pre-training on unbiased external corpora (Webster et al., 2020;Lauscher et al., 2021;He et al., 2022), or finetuning to debias (Cheng et al., 2021;Guo et al., 2022).The effectiveness of such debiasing efforts is typically measured on intrinsic benchmarks like SEAT (Sentence Encoding Association Test) which computes the association between demographic terms (e.g., woman, man) and stereotype terms (e.g., science, art).An unbiased model should display no difference in the similarity between the representations of these terms (May et al., 2019).
While these existing approaches help reduce social biases under intrinsic measures, these debiasthen-finetune methods are based on the hypothesis that if an upstream model is unbiased, it will also preserve its fairness effects on downstream tasks during the fine-tuning process.However, recent research investigating the relationship between intrinsic and extrinsic benchmarks (which evaluate fairness in downstream applications) finds these two benchmarks correlate weakly (Kaneko et al., 2022).Furthermore, they observe that models, even after being debiased, tend to re-acquire or even amplify biases (e.g., instance-related biases and labelrelated biases) during the fine-tuning process on downstream tasks (Zhao et al., 2017;Leino et al., 2019).Thus, this mismatch leads to our motivating research question -How can we develop an efficient and effective method to mitigate bias on downstream tasks?
To answer the aforementioned question, we propose Co 2 PT, a debias-while-prompt tuning approach through Counterfactual Contrastive Prompt

Pre-trained Language Model
Two dogs are running.
The woman is playing the piano.The man is playing the piano.Tuning.In this method, we first freeze all parameters of the PLM and add tunable continuous prompts for every layer.Unlike the previous debias-then-finetune methods that require expensive re-training of the original PLM and risk knowledge forgetting, this deep prompt tuning framework saves computational and memory resources while preserving the original pre-trained knowledge and language modeling ability (Li and Liang, 2021;Liu et al., 2022).To ensure that a fair system generates unbiased results regardless of the demographic terms used, we construct counterfactual pairs directly from the training data, eliminating the need for external corpora that heavily depend on their quality for debiasing.Specifically, we replace demographic terms associated with either the dominant or minoritized group in the training data with terms representing the opposite group.Then, we integrate the ability to mitigate bias into the prompt parameters through a contrastive objective between counterfactual pairs while maintaining the parameters of PLMs frozen.Co 2 PT can be integrated into existing debiased models to help them mitigate biases on downstream tasks and offer flexibility in addressing different kinds of bias.These advantages establish Co 2 PT as an efficient and effective method for mitigating bias in downstream tasks.
In conclusion, the proposed Co 2 PT mitigates bias on downstream tasks through prompt tuning, making the following contributions: • Co 2 PT achieves time and memory efficiency without requiring access to an external corpus or retraining the entire model.
• Over three extrinsic bias benchmarks, we show that Co 2 PT effectively mitigates bias amplified during the prompt tuning process on downstream tasks.
• Furthermore, Co 2 PT can be extended to existing debiased language models, effectively bridging the gap between debiased upstream models and downstream tasks.

Related Work
Several approaches have been proposed for debiasing pre-trained language models such as projectionbased methods (Dev et al., 2020;Liang et al., 2020;Ravfogel et al., 2020;Kaneko and Bollegala, 2021), post-hoc text generation techniques (Schick et al., 2021), adversarial methods (Han et al., 2021), finetuning on biased prompts (Guo et al., 2022), with contrastive objective (Cheng et al., 2021) or with augmented data (Zhao et al., 2018), additional pretraining methods on re-balanced corpus through counterfactual data augmentation (Webster et al., 2020;Lauscher et al., 2021;Meade et al., 2022) or with a contrastive objective on gender-balanced entailment pairs (He et al., 2022), using dropout regularization (Webster et al., 2020), through parameterefficient methods (Lauscher et al., 2021;Yang et al., 2022;Xie and Lukasiewicz, 2023) or with a contrastive objective (Li et al., 2023).While some works do not require access to an external corpus or do not require retraining the entire model, most prior methods primarily focus on mitigating bias within the model's intrinsic characteristics and evaluate the effectiveness of bias mitigation through intrinsic bias benchmarks, e.g., SEAT (May et al., 2019), StereoSet (Nadeem et al., 2021), and CrowS-Pairs (Nangia et al., 2020).Subsequently, they fine-tune the debiased models on downstream tasks and demonstrate that their debiased models retain the language modeling ability and the performance on downstream tasks or extrinsic bias benchmarks, which evaluate fairness in downstream tasks by testing whether the models exhibit different performances among different populations.
Nevertheless, recent research shows that these debias-then-finetune methods will re-acquire or even amplify biases during the fine-tuning process on downstream tasks and that intrinsic and extrinsic evaluation bias benchmarks correlate poorly (Goldfarb-Tarrant et al., 2021;Cao et al., 2022;Kaneko et al., 2022).They encourage researchers to focus directly on extrinsic measures of bias of specific applications when addressing bias mitigation (Goldfarb-Tarrant et al., 2021).
Thus, we focus in this paper on mitigating bias on downstream tasks and evaluate using extrinsic evaluation benchmarks directly.In addition, different from the previous methods requiring further pre-training on the counterfactually augmented sentences from an external corpus, e.g., English Wikipedia (Zmigrod et al., 2019;Webster et al., 2020;Meade et al., 2022), BookCorpus (Lauscher et al., 2021), News-Commentary v15 (Yang et al., 2022) or NLI (He et al., 2022), our methods achieve time and memory efficiency by eliminating the need for external corpus access or model retraining.

Co 2 PT: Debiasing via Counterfactual Contrastive Prompt Tuning
We propose Co 2 PT, a debias-while-prompt tuning parameter-efficient method for mitigating biases on downstream tasks via counterfactual contrastive prompt tuning, presented in Figure 1.Concretely, Co 2 PT mitigates bias in PLMs by leveraging counterfactual pairs from training data to produce debiased representations during prompt tuning.
Deep Prompt Tuning.First, we introduce the backbone framework of Co 2 PT -deep prompt tuning.We incorporate continuous prompts as prefix tokens in every layer of the PLM.By doing this, we have more tunable task-specific parameters to enhance per-task capacity while maintaining parameter efficiency (Li and Liang, 2021;Liu et al., 2022;Wang et al., 2022;Dong et al., 2023a).Besides, it can achieve comparable performance to fine-tuning, outperforming methods that only add trainable continuous prompts into the input embedding layer (Lester et al., 2021;Liu et al., 2021), which underperform the fine-tuning methods, especially when the model size is not large (Liu et al., 2022).The prompt tuning loss of proposed Co 2 PT on the downstream task is represented as L pt , e.g., cross-entropy loss for a classification task.
Counterfactual Pairs Construction.Then, the first key question is: how to interject the debiasing capability into the continuous prompts?An unbiased model should make the same predictions independent of the bias-attribute term, thus we apply counterfactual data augmentation to generate counterparts of training examples from the training data during prompt tuning.Concretely, let S represent the training corpus and let W = {(w 1 , w 2 , . . ., w m ) i } N i=1 be a set of N biasattribute term pairs.For each sentence s i in S and each pair (w 1 , w 2 , . . ., w m ) in W , for any w i in s, we replace it with the term along the opposite bias direction.Take the binary-gender debiasing task shown in Figure 1 for example, the bias-attribute terms are {(man, woman), (he, she), . . .}.The "man" is in the input sentence "The man is playing the piano".We replace it with "woman" while leaving non-attribute words unchanged.Then the counterfactually augmented sentence is "The woman is playing the piano", and vice versa.The obtained counterfactual sentence of the original sentence s i is denoted as s ′ i .Counterfactual Contrastive Learning.The counterfactual pairs construction allows us to achieve a balance in inputs containing bias-attribute terms.However, how can we ensure that our model generates consistent predictions for both s i and s ′ i , which possess similar semantic meaning but differ in bias direction?To make the model generate predictions independent of biased attributes, it is important for sentences with similar semantics but along different bias directions to be closer (Cheng et al., 2021;He et al., 2022).We apply contrastive learning, of which the objective is to obtain meaningful representations by bringing semantically similar neighbors closer and pushing apart the dissimilar neighbors (Gao et al., 2021;Dong et al., 2023b;Li et al., 2023).In this work, input sentence s i and its counterpart s ′ i are semantically related but in opposite bias directions.We let h i and h ′ i denote the representations of s i and s ′ i and then concatenate with the continuous prompt representation p as positive pairs.Then we take the cross-entropy objective with in-batch negatives (Gao et al., 2021).
The training objective for (h i , h ′ i ) with a mini-batch of N pairs is: where sim(x i , y i ) is the cosine similarity of x i and is the concatenation of two representations, and τ is a temperature hyperparameter.
For counterfactual pairs (s i , s ′ i ) in the singlesentence classification task, s i is the original sentence from the training data and s ′ i is the augmented sentence that has the same semantic meaning as s i but in a different bias direction.For sentence-pair classification, like in the SNLI task, with x i as the premise and y i as the hypothesis, s i is the original premise-hypothesis pair (x i , y i ) while s ′ i is the counterfactual augmented premise-hypothesis pair Similarly, the sentence representations are concatenated with continuous prompts to calculate the contrastive loss through Equation 1.
Learning Objectives.Finally, the continuous prompts learn-to-debias by simultaneously optimizing the prompt tuning loss L pt on downstream tasks and contrastive loss L cl between the counterfactual pairs: where α is a tunable coefficient hyperparameter.As stated before, we only tune the parameters of the debiasing continuous prompts while maintaining the parameters of PLMs frozen throughout the training.After the counterfactual contrastive prompt tuning, the debiasing knowledge is stored in the prompt parameters.This approach not only retains the knowledge within the original parameters of PLMs but is also flexible and adaptable to different downstream tasks.For example, we can train different prompts for different bias dimensions such as gender, race, and religion.These prompts can then be combined and applied to downstream tasks.Moreover, considering that prior research primarily concentrates on binary gender, it is more efficient to extend its application to non-binary gender without requiring re-training new debiased models.

Experimental Setup
We design experiments to test the effectiveness of our proposed Co 2 PT approach toward answering four questions: RQ1: Will Co 2 PT mitigate bias on downstream tasks effectively?RQ2: How will the existing intrinsic debiased methods perform on the downstream tasks when they are combined with Co 2 PT? RQ3: What impact do different modules have on the design of Co 2 PT? RQ4: How do hyperparameters affect Co 2 PT?

Bias Evaluation
Extrinsic bias benchmarks assess bias via performance gap between different groups in downstream tasks.In this work, we evaluate Co 2 PT on three widely used extrinsic bias benchmarks: Bias-STS-B, Bias-NLI, and Bias-in-Bios.
Bias-STS-B (Webster et al., 2020) is adapted from the STS-B task to evaluate gendered correlations, which requires models to predict the semantic similarity between pairs of sentences.Specifically, 276 sentences are collected from the test set as templates and then gendered terms (man, woman) and professional terms from Rudinger et al. (2018) are inserted into each template, forming 16,980 sentence pairs.For instance, if the template is "A man is walking", then the sentence pairs are ("A man is walking", "A nurse is walking") and ("A woman is walking", "A nurse is walking").If a model is unbiased towards gender terms, it should assign equal similarity scores to both pairs.We calculate the average absolute difference between the similarity scores of sentence pairs containing male and female terms, and how often the difference between "male" and "female" sentence pairs > τ , where we report the results for τ = 0.1 and τ = 0.3 (Webster et al., 2020).A lower value indicates less bias.Bias-NLI (Dev et al., 2020) is a natural language inference dataset consisting of neutral sentence pairs to evaluate the gender-occupation bias.It is constructed by populating the template: "The subject verb a/an object", leading to 1,936,512 instances.
Concretely, the verb and object slots are filled with activities, e.g., ate a bagel.Then they create neutral entailment pairs by filling the subject slot with an occupation term with a strong gender correlation, e.g., "nurse", for the hypothesis, and "woman", for the premise, resulting in the instance: The woman ate a bagel; The nurse ate a bagel.neutral.Bias is defined as deviation from neutrality and measured by three metrics: (1) Net Neutral (NN): the average probability that the model assigns a neutral label for all instances, (2) Fraction Neutral (FN): the percentage of instances that the model predicts the neutral label and (3) Threshold: τ (T: τ ): the fraction of examples whose probability of neutral are above τ .We report the results for τ = 0.5 and τ = 0.7 following (Lauscher et al., 2021;He et al., 2022).All three metrics will attain 1 for a bias-free model.Bias-in-Bios (De-Arteaga et al., 2019) is a largescale English dataset studying gender bias in oc-cupation classification from the Common Crawl corpus.We report the overall accuracy of the task as well as the accuracy breakdown based on gender.
To quantify gender bias, we compute the difference in true positive rates (TPR) between genders across various occupations, denoted as GAP TPR g and defined as follows: where TPR g represents the proportion of individuals correctly predicted given their gender g, and g and ∼g are binary genders.Following (Romanov et al., 2019;Ravfogel et al., 2020;He et al., 2022), we also calculate the root mean square of the peroccupation TPR gender gap GAP TPR g,o over all occupations o: A value closer to 0 indicates a lower degree of bias.

Datasets and Setup
STS-B and SNLI.We fine-tune the models on SNLI and STS-B training sets and pick the one that performs best on the validation set and then evaluate bias using Bias-STS-B and Bias-NLI, respectively.Bias-in-Bios.We use the same data as Ravfogel et al. (2020)

Baseline Models
We compare Co 2 PT with six upstream debiased models fine-tuned on the downstream tasks, and three baselines fine-tune or prompt-tune BERT models on the downstream tasks: ZariCDA (Webster et al., 2020) is pre-trained from scratch over counterfactual data augmented from English Wikipedia and ZariDO (Webster et al., 2020) is additionally pre-trained with increased dropout rate.

Implementation Details
For fine-tuning debiased baselines and vanilla BERT, we use the models released by the authors.We set the max length of the sentence to be 128, the learning rate to be 2e − 5, the batch size to be 64, and train for 10 epochs.For AT+CDA, learning rate is 2e-5 and batch size is 128.For PT and Co 2 PT, the backbone models are bert-base-uncased with the learning rate set to 1e − 2 and the prompt length set to 20 and are trained for 30 epochs.The batch size is set to 32.For hyperparameters τ and α in Equation 1and 2, we set τ = 0.05 and α = 1.0.All experiments are run on a single NVIDIA RTX A5000 24GB GPU.For each run, we save the model that performs the best on the development set and evaluate it on extrinsic benchmarks.We report the average results across three runs.All code and data are available at https://github.com/dongxiangjue/Co2PT; additional experimental details and standard deviations are in Appendix A and Appendix C, respectively.

Debiasing Effectiveness (RQ1)
We now investigate the effectiveness of Co 2 PT in mitigating bias on three extrinsic bias benchmarks.
Bias-STS-B.First, we focus on the Bias-STS-B benchmark.As Table 2 indicates, Co 2 PT shows the lowest bias scores across all metrics and achieves similar model performance on downstream tasks as the other debiased baselines.We observe that some debiased models exhibit higher bias scores than the original BERT, indicating that debiased language models can relearn biases during finetuning on downstream tasks.For example, Auto-Debias, one of the state-of-the-art debiased models, demonstrates strong fairness on intrinsic benchmarks, such as SEAT, scores 0.312 in average absolute difference, showing a higher bias level than the original BERT and most of the other baselines.On the other hand, MABEL, which shows strong performance in debiasing downstream tasks, achieves a competitive score of 0.081.Furthermore, compared to fine-tuning the original BERT model on the STS-B training set, PT results in a higher bias score of 0.321 compared to 0.282.This suggests that while fine-tuning only the number of prompt tokens may be parameter efficient, it can result in increased bias due to the presence of an unbalanced dataset.For Co 2 PT, we observe a significant reduction in the bias score with the average absolute difference decreasing from 0.321 to 0.058, from 0.749 to 0.167 when the difference exceeds 0.1, and from 0.369 to 0.005 when the difference exceeds 0.3.These findings indicate a substantial improvement in the ability to mitigate bias.Bias-NLI.Next, we focus on the Bias-NLI extrinsic benchmark shown in Table 3 scores than the original BERT across all metrics while the other baseline methods amplify biases during the fine-tuning.Similarly, Auto-Debias performs well on the SEAT benchmark but experiences an increase in bias when applied to downstream tasks, mirroring the trend observed in the Bias-STS-B extrinsic benchmark.Moreover, ADELE, another parameter-efficient method, performs poorly in both bias mitigation and model accuracy.Similar to Bias-STS-B extrinsic benchmark, PT amplifies biases during the tuning process, resulting in a decline in the NN score from 0.824 to 0.741 and the FN score from 0.868 to 0.812.By employing Co 2 PT, we observe significant improvements with the NN score rising to 0.877 (from 0.741) and the FN score reaching 0.965 (from 0.812), indicating the effectiveness of Co 2 PT on bias mitigation.
Bias-in-Bios.Next we show the performance on the Bias-in-Bios benchmark in Table 4.Among all the baselines, the ZariCDA achieves the lowest GAP TPR score of 2.667 while the BERT+CDA achieves the lowest GAP RMS score of 0.113.Furthermore, PT exacerbates bias and results in an increase in the GAP TPR score from 2.822 to 3.171 and GAP RMS from 0.119 to 0.129.In contrast, Co 2 PT reduces the GAP TPR score from 3.171 to 2.537 and GAP RMS from 0.129 to 0.123, demonstrating its effectiveness in mitigating bias in the occupation classification task.
6 Integrating Co 2 PT with Existing Debiased Models (RQ2) One benefit of a prompt-then-finetune model like Co 2 PT is that it can be easily integrated with existing upstream debiasing methods.Here we investigate the applicability of Co 2 PT to three existing debiased models to bridge the gap in utilizing upstream debiased models for downstream tasks.
Based on the comparison results -before and after applying Co 2 PT -shown in Table 5, Co 2 PT significantly reduces the bias scores for Context-Debias and Auto-Debias: 0.088 versus 0.332, and 0.068 versus 0.312, respectively.For MABEL, which achieves low bias scores, there is no significant effect on the bias score.Additionally, Co 2 PT improves the model performance on the downstream tasks.These results clearly demonstrate the effectiveness of integrating Co 2 PT into established debiased models for downstream tasks.This ability enables the existing debiased models to achieve strong performance on downstream tasks while simultaneously maintaining a low bias level.7 Impact of Design (RQ3) We perform an extensive ablation study to show how different components affect Co 2 PT in Table 6.We use Bias-STS-B as the representative task for computational efficiency.
Impact of counterfactual module.First, we perform counterfactual data augmentation on the training data containing bias-attribute terms.Then we conduct prompt tuning only on these augmented pairs (denoted as PT+CDA).PT+CDA reduces the bias score from 0.321 in PT to 0.291, showing the effectiveness of the straightforward counterfactual data augmentation approach.However, the improvement is less than Co 2 PT, implying the necessity of the contrastive learning module.Impact of contrastive module.To investigate the impact of the contrastive module, instead of employing constructed counterfactual sentence pairs as positive pairs for contrastive loss, we use unsupervised contrastive loss by encoding the same input twice and get two embeddings with different dropout masks z, z ′ (Gao et al., 2021).Then the contrastive objective becomes: datasets in He et al. (2022) as task-agnostic entailment pairs for STS-B task instead of using the taskspecific counterfactual pairs augmented from the training set (denoted as PT+NLI+CL).We notice that although PT+NLI+CL does not outperform Co 2 PT, it shows a strong ability to mitigate bias compared to other baseline methods.Thus, when working on a moderate amount of training data, it is better to use counterfactually augmented pairs from the training data.
Compare Co 2 PT with other contrastive objective.For sentence-pair classification tasks, we also explore the contrastive loss that encourages the inter-association of entailment pairs (He et al., 2022).For the original input pair (s i1 , s i2 ) and its augmented pair (s ′ i 1 , s ′ i2 ), s i1 and s i2 are treated as positive pairs while s i1 and s ′ i2 and other inbatch s j2 are negatives, and vice versa (denoted as CL p ).When using task-specific counterfactual pairs, PT+CDA+CL p decreases the bias score to 0.271.Similarly, using task-agnostic counterfactual pairs, PT+NLI+CL p also reduces the bias score to 0.271.However, the bias mitigation effect is not as significant as that achieved by Co 2 PT, which indicates the effectiveness of the contrastive module in Co 2 PT.

Impact of Hyperparameters (RQ4)
Finally, we investigate the impact of three hyperparameters: (i) the continuous prompt length; (ii) the temperature τ of contrastive loss L cl ; and (iii) the coefficient α of total learning objective L.
Impact of prompt length.First, we experiment with the prompt length varying in {10, 20, 50}, as illustrated in Figure 2. Generally speaking, with more tunable prompt parameters, the model performs better on downstream tasks.In addition, when the prompt length is 10, Co 2 PT shows a higher increase in bias score compared to the prompt length of 20 and 50.This indicates that a  larger prompt length enables the model to achieve better model performance on downstream tasks more rapidly while still maintaining a lower bias score.However, it is important to consider that using larger prompt lengths means tuning more parameters, thus posing a trade-off.Impact of τ .Then, we vary temperature τ in {0.005, 0.05, 0.5}. Figure 3 shows close significant bias mitigation effects when τ is set to 0.005 and 0.05 while exhibiting less effectiveness when τ is 0.5.This observation implies that a higher temperature value corresponds to less weight of the cosine similarity calculation, resulting in decreased effectiveness in bias mitigation.Impact of α.Last, we study the impact of coefficient α and vary the value in {0.1, 0.5, 1.0}.Figure 4 emphasizes that reducing the value of α at a constant τ , thus assigning less weight to the contrastive module, leads to decreased bias mitigation effects.This analysis underscores the importance of carefully selecting appropriate hyperparameters.

Conclusion and Future Work
We propose Co 2 PT, an efficient and effective debiasing method for mitigating bias in downstream tasks.We evaluate its effectiveness on bias mitigation and applicability to existing debiased upstream models, and investigate how the design of each component and the selection of hyperparameters impact both its bias reduction capabilities and downstream task performance.
Mitigating non-gender and intersectional bias.
Mitigating non-gender biases is challenging as some debiasing methods work well on reducing gender biases but show poor generalization capabilities in addressing biases beyond gender (Meade et al., 2022).Without re-training the model, Co 2 PT is flexible to apply in order to mitigate different bias types in downstream applications.One can train different debiasing prompts to tackle different bias dimensions such as gender, race, and religion.Furthermore, these debiasing prompts can be applied to mitigate intersectional bias by simply combining the corresponding prompts in downstream tasks.

Limitations
While this work primarily addresses bias in English, we acknowledge the presence of more complicated bias cases in other languages.Therefore, future exploration of existing methods or the development of new techniques to mitigate bias in other languages would be valuable.Furthermore, despite the efficiency and comparable performance of deep prompt tuning compared to fine-tuning, it still underperforms fine-tuning on certain datasets when the model size is small.This will also limit the model performance of our method.

Ethics Statement
In this work, when investigating gender bias in pre-trained language models, we focus on the binary definition of gender as the targeted attribute of discrimination.However, it is important to acknowledge that future research should also consider non-binary genders and other multi-class scenarios to comprehensively address bias.
The overview of Co 2 PT.First, we construct counterfactual pairs from the training data.Then, we learn debiased continuous prompts by simultaneously optimizing prompt tuning loss L pt on downstream tasks and contrastive loss L cl between the counterfactual pairs.
(Devlin et al., 2019)adds continuous prompts to each layer of the models and then tunes the prompts on downstream tasks.The backbone models for ZariCDA and ZariDO are BERT-large-uncased whereas other baselines are BERT-base-uncased(Devlin et al., 2019).1

Table 2 :
Evaluation on Bias-STS-B.†: results are reported from the ADELE model in the original paper; ⋆: backbone model is BERT-large-uncased.

Table 3 :
Evaluation on Bias-NLI.†: results are finetuned on MNLI and reported from the ADELE-TA model in the original paper; ‡: adapter tuning on counterfactually augmented data; ⋆: backbone model is BERT-large-uncased.Other baselines are fine-tuned on SNLI.

Table 4 :
Evaluation on Bias-in-Bios.⋆: backbone model is BERT-large-uncased.The results of ADELE on this benchmark are not reported in the original paper.

Table 5 :
Performance of integrating Co 2 PT with debiased models on Bias-STS-B.

Table 6 :
Co 2 PT+SCL n ).The bias score of 0.117 achieved by Co 2 PT+SCL n is higher than Co 2 PT and this indicates that incorporating a contrastive loss for non-augmented inputs in the training set is unnecessary.Compare Co 2 PT with task-agnostic counterfactual pairs.To investigate whether integrating taskagnostic neutral entailment pairs can benefit debiasing on the task, we use 142,158 gender-balanced entailment pairs augmented from SNLI and MNLI Impact of different components.