Conditional Supervised Contrastive Learning for Fair Text Classification

,


Introduction
Recent progress in natural language processing (NLP) has led to its increasing use in various domains, such as machine translation, virtual assistants, and social media monitoring. However, studies have demonstrated societal bias in existing NLP models (Bolukbasi et al., 2016;Zhao et al., 2017;May et al., 2019;Bordia and Bowman, 2019;Hutchinson et al., 2020;Webster et al., 2020;de Vassimon Manela et al., 2021;Sheng et al., 2021). In one major NLP application, text classification, bias is referred as the performance disparity of the trained classifiers over different demographic groups such as gender and ethnicity (Sun et al., 2019;Weidinger et al., 2021). Such bias poses potential risks: for example, toxicity classification models in online social media platforms show disparate performance in different social groups, leading to increased silencing of underserved groups (Dixon et al., 2018;Blodgett et al., 2020).
Meanwhile, an increasing line of work in contrastive learning (CL) has led to significant advances in representation learning (Hadsell et al., 2006;Logeswaran and Lee, 2018;He et al., 2020;Henaff, 2020;Chen et al., 2020;Khosla et al., 2020;Gao et al., 2021b). The general idea of contrastive learning in these works is to learn representations such that similar examples stay close to each other while dissimilar ones are far apart. Inspired by those works, recent works (Shen et al., 2021;Tsai et al., 2021Tsai et al., , 2022) also propose to leverage contrastive learning to learn fair representations in classification. However, these works either lack theoretical justifications for the proposed approaches or adopt demographic parity (Dwork et al., 2012) as the fairness criterion, which eliminates the perfect classifier in the common scenario when the base rates differ among demographic groups (Hardt et al., 2016;Zhao and Gordon, 2019).
In this work, we aim to mitigate bias in text classification models via contrastive learning. In particular, we adopt the fairness notion, equalized odds (EO) (Hardt et al., 2016), which asks for equal true positive rates (TPRs) and false positive rates (FPRs) across different demographic groups (Zhao et al., 2019a). Based on information-theoretic concepts, we bridge the problem of learning fair representations with equalized odds constraint with contrastive learning objectives. We then propose an algorithm, called conditional supervised contrastive learning, to learn fair text classifiers.
Empirically, we conduct experiments on two text classification datasets (e.g., toxic comment classification and biography classification) to show the proposed methods (1) can flexibly tune the trade-offs between main task performance and the fairness constraint; (2) achieve the best tradeoffs between main task performance and equalized odds compared to the existing bias mitigation approaches in text classification; (3) are stable to different hyperparameter settings, such as data augmentations, temperatures, and batch sizes. To the best of our knowledge, our work is the first to both theoretically and empirically study how to ensure the EO constraint via contrastive learning in text classification.

Background
We use X ∈ X and Y ∈ Y to denote the random variables for the input text and the categorical label for the main task, respectively. Furthermore, A ∈ A is the sensitive attribute (protected group) associated with the input text X (e.g., the gender information in the occupation classification task). The corresponding lowercase letters denote the instantiation of the random variables. Given a text encoder f : X → Z (e.g., BERT (Devlin et al., 2019)) and a classifier g : Z → Y, we first transform the input text X into latent representation Z via f , and Z is used to give a predictionŶ via g (i.e., X f − → Z g − →Ŷ ). In the context of contrastive learning, data augmentation strategies have been widely adopted. Let T be a set of data augmentations and X ′ be the augmented input given the data augmentation t(·): X ′ = t(X), t ∼ T , where we assume that the augmentation t is sampled uniformly at random from T . Similarly, we have X ′ f − → Z ′ g − →Ŷ ′ . Let H denote the entropy and I denote the mutual information, e.g., H(Z | Z ′ , Y ) is the conditional entropy of Z given Z ′ and Y , and I(Z ′ ; Z | Y ) is the conditional mutual information of Z ′ and Z given Y . Due to the space limit, we refer readers to Cover (1999) for more background knowledge of the related notions (entropy and mutual information) in information theory.
We assume there is a joint distribution over X, Y , and A from which the data are sampled. Figure 1 shows the graphical model of the dependencies between input variables and outputs. We also assume that the sensitive attribute A is available only during model training, but it is not available during the testing phase. As a result, any post-processing methods that leverage sensitive attributes for bias mitigation during the testing phase are not feasible in our setting. In this work, we use Figure 1: Graphical model of the dependencies between input variables and outputs. Note that we only assume there is a joint distribution over X, Y , and A from which the data are sampled, so the figure only shows one case of the dependencies over X, Y , and A. equalized odds, a more refined fairness criterion for classification problems. At a high level, EO asks the model prediction to be independent of the sensitive attribute conditioned on the task label. If a model perfectly satisfies equalized odds, the differences of true positive rates and false positive rates across demographic groups will be 0. Equivalently, it also implies I(Ŷ ; A | Y ) = 0. Consider online comment toxicity classification as a real-world example to motivate the use of EO as a notion of fairness. In this case, false positive cases (benign text comments marked as toxic) can be seen as unintentional censoring, and false negative cases (toxic text comments marked as benign) might result in debates and discomforts (Baldini et al., 2021).
In contrast to another well-known group fairness definition, i.e., demographic parity, EO does not require positive prediction rates to be the same across different demographic groups, which could possibly severely downgrade the model performance when the sensitive attribute is correlated to the task label (Hardt et al., 2016;Zhao and Gordon, 2019).

Our Method
In this section, we first theoretically connect learning fair representations with contrastive learning (Sec. 3.1). In particular, we first show that learning fair representations for equalized odds requires the minimization of I(Z ′ ; Z | Y ) and the simultane-ous maximization of I(Z ′ ; Z | A, Y ). To this end, we provide an upper bound of I(Z ′ ; Z | Y ) and a lower bound of I(Z ′ ; Z | A, Y ) to relax the original objective and then establish a relationship between the bounds and the (conditional) supervised contrastive learning objectives. Finally, inspired by our theoretical analysis, we design two practical methods for learning fair representations (Sec. 3.2). Due to the space limit, we defer all detailed proofs to Appendix A.

Connections between Contrastive Learning and Learning Fair Representations
In order to learn a model (text encoder followed by classifier) to satisfy equalized odds, we aim to learn a latent representation Z such that Z ⊥ A | Y . From an information-theoretic perspective, it suffices to minimize the conditional mutual information I(Z; A | Y ) to ensure EO due to the celebrated data-processing inequality. We identify a connection between contrastive learning and learning fair representations when the representations enjoy certain benign structures. Next, we formally state the assumptions to characterize such a structure.
Assumption 3.1. Let Z and Z ′ be the corresponding features from X and X ′ , respectively. We assume that there exists a small positive constant At a high level, Assumption 3.1 says that the learned features from the contrastive learning procedure are well conditionally aligned (Wang and Isola, 2020). Specifically, given the label of a feature and its corresponding augmented feature, it is relatively easy to infer the corresponding positive pair used in the contrastive learning procedure. Note that the conditional entropy could be understood as the minimum inference error from this perspective (Farnia and Tse, 2016). Under Assumption 3.1, we provide the following lemma to characterize the relationship between Z, Z ′ , A, and Y in terms of (conditional) mutual information.
Lemma 3.1. Under Assumption 3.1, given a set of data augmentations T , let X ′ be the augmented input data where X ′ = t(X), t ∼ T . Assuming the following Markov chains X f → Z g →Ŷ and Lemma 3.1 indicates that we can minimize In what follows, we will present an upper (lower) bound to mini- and connect the bounds with contrastive learning objectives. We first provide an upper bound of Proposition 3.1. Given the assumptions in Lemma 3.1, we have In order to better interpret the right side in Proposition 3.1, we define a similarity function s(z ′ , z; y) between z ′ and z for each y and assume s(z ′ , z; y) ∝ p(z ′ | z, y) (i.e., the more similar z ′ and z are in the latent space given task label y, the more likely z ′ is generated by z via data augmentation) 2 . With this assumption, the upper bound provided in Proposition 3.1 implies that I(Z ′ ; Z | Y ) can be minimized by encouraging similarity between any latent representations given the same task label, which is consistent with the goal of supervised contrastive loss (Khosla et al., 2020). Formally, given a batch of augmented examples , where the last half examples of the batch are the augmented views of the first half and they share the same task labels (as well as the same sensitive attributes), i.e., and t ∼ T . Let N y i be the total number of examples in the batch that have the same task label as y i , then supervised contrastive loss takes the following form: and ℓ ij is defined as where τ is the temperature parameter, 1 i̸ =k = 1{i ̸ = k} and 1{·} is the indicator function, the similarity function is s f ( where p(·) ⊗N denotes the probability distributions of N independent examples and s(·, ·) is any similarity function that measure the similarity of z ′ i and z i . Then, we have with size 2N , we can formulate the contrastive objective as and ℓ i is defined as where N a i ,y i is the total number of examples in the batch that have the same task label and sensitive attribute as y i and a i , and x i and x ′ i are the different views of the same example.
Interpretation of L sup and L CS-InfoNCE in learning fair representations. In learning fair representations, the role of L sup is to learn aligned and uniform representations (Wang and Isola, 2020) for each task label, while the role of L CS-InfoNCE is to encourage the dissimilarity of different examples that share the same task labels and sensitive attributes. In an ideal case where L sup = 0, each data point that shares the same task label in the latent space collapse to a single point, and the perfect representations are learned. In this case, L CS-InfoNCE = 0 as well. In practice, the overall combined effect of the L sup and L CS-InfoNCE will encourage the similarity of examples having the same task label but belonging to different groups. Thus, our theory could also explain why other slightly different proposed contrastive objectives in a concurrent work (Park et al., 2022) could mitigate equalized odds. In Appendix C.2, we provide T-SNE visualization (Van der Maaten and Hinton, 2008) of the text embeddings using different training objectives to help better understand our methods.

Practical Implementations
The existing contrastive representation learning approaches fall into two categories: two-stage methods (Khosla et al., 2020;Chen et al., 2020) and onestage methods (Gunel et al., 2021;Cui et al., 2021). Two-stage methods first pretrain the encoder in the first stage using the contrastive objective, then fix the encoder, and fine-tune the classifier using crossentropy (CE) loss in the second stage. One-stage methods train both encoder and classifier using CE loss and contrastive loss end-to-end. Following the previous settings, we also implement our methods in these two ways. For the two-stage CL method, we first pretrain the text encoder using the following loss function in the first stage: then we fix the pretrained encoder, and fine-tune classifier using CE loss. Note that λ ≥ 0 controls the intensity of L CS-InfoNCE . For the one-stage CL method, similar to Gunel et al. (2021), we formulate the loss function as: where γ ∈ [0, 1] controls the relative weight of L sup compared to L CE . The major advantage of our approach is that it can be directly substituted into existing NLP pipelines that use the "pretrainand-finetune" paradigm popularized by large language models such as BERT. NLP practitioners can swap the fair CL finetuner into these pipelines to boost model fairness at low cost, with robust behavior against hyperparameter choices (see Sec. 4.2). Whereas large language models made it simple to build models with high performance, fair CL makes it simple to build models with high performance and fairness.

Experiments
In this section, we conduct experiments to investigate the following research questions: RQ 1. How can we control the trade-offs between model classification performance and fairness via conditional supervised contrastive learning?
RQ 2. How do conditional supervised contrastive learning methods perform in terms of tradeoffs between model performance and fairness compared to other in-processing bias mitigation methods in text classification?
RQ 3. Is conditional supervised contrastive learning sensitive to hyperparameter changes?

Experimental Setup
Datasets. We perform experiments using the following two datasets (see Appendix B for more details of the datasets and the data prepossessing pipelines): • Jigsaw-toxicity 3 is a dataset for online comment toxicity classification. The main task of the dataset is to determine if the online comment is toxic, and we use "race and ethnicity" as the sensitive attribute (e.g., whether "black" identity is mentioned in the comment text or not).
• Biasbios (De-Arteaga et al., 2019) is a dataset for occupation classification. The main task of the dataset is to determine the people's occupations given their biographies.
where ∆ TPR (∆ FPR ) is the true positive rate (false negative rate) for sensitive attribute a and TPR overall (FPR overall ) is the overall true positive rate (false negative rate). Following Pruksachatkun et al.
Implementations and Baselines. In our experiments, we use BERT (Devlin et al., 2019) (bert-base-uncased as the text encoder followed by a two-layer MLP as the classifier) 4 . As suggested by previous works (Khosla et al., 2020;Gao et al., 2021b), the performance of contrastive learning is closely related to the choice of the following hyperparameters: (1) temperature, (2) (pre-training) batch size, and (3) data augmentation strategy. Thus, we conduct a gridbased hyperparameter search for temperature τ ∈ {0.1, 0.5, 1.0, 2.0}, (pre-training) batch size bsz ∈ {32, 64, 128, 256}, and data augmentation strategy t ∈ {EDA, back translation, CLM insert, CLM substitute} (see Appendix B for the detailed description of different augmentation strategies) for both two-stage CL and one-stage CL. We also conduct grid search of γ ∈ {0.1, 0.3, 0.7, 0.9} in Eq. (4) for one-stage CL. In Appendix B, we provide the remaining hyperparameter details (e.g., learning rate, training epochs, optimizer). Since it is not feasible to train large language models with large batch sizes via contrastive objectives given limited GPU memory, we use the gradient cache technique (Gao et al., 2021a) to adapt our implementations to limited GPU memory settings.
We compare our methods with the following baselines, which have been empirically demonstrated effective for bias mitigation in text classification: (1) Adversarial training (Elazar and Goldberg, 2018): Following the encoder + classifier setting, adversarial training leverages a discriminator to learn latent representations oblivious to the sensitive attribute. Note that the original adversarial training method is tailored for demographic parity and it is well known that demographic parity and equal odds are incompatible given different base rates (Kleinberg et al., 2017;Ball-Burack et al., 2021).
To this end, we use the conditional learning techniques (Madras et al., 2018;Zhao et al., 2019a) to adapt adversarial training for equalized odds.
(2) Adversarial training with diverse adversaries (diverse adversaries) (Han et al., 2021): Adversarial training with diverse adversaries improves adversarial training by using an ensemble of discriminators and encourages the discriminator to learn orthogonal representations. Similar to adversarial training, we also apply the conditional learning techniques for learning the adversarial discriminators.
(3) Iterative null-space projection (INLP) (Ravfogel et al., 2020): Given a pretrained text encoder (we use CE loss to pretrain the text encoder and drop the prediction head using the validation set), INLP learns a linear guarding layer on top of the pretrained text encoder to filter the sensitive information and fine-tune the classifier given the pretrained text encoder and INLP. INLP learns the linear guarding layer by projecting the parameter matrices of linear classifiers (e.g., SVM) to their null spaces iteratively. The training data of linear classifiers are the latent representations of input texts and sensitive attributes. In order to tailor INLP for equaled odds, Ravfogel et al.
(2020) learns the linear classifier given the data from the same class each round.
We also use training using CE loss as a baseline. Except for INLP, all methods we test in our experiments train the text encoder (e.g., BERT) directly, while INLP is a post-hoc debiasing method given a text encoder. In a sense, INLP is orthogonal to other methods since it tries to remove groupspecific information after we learn the representations, while other methods learn the fair representations directly. We run each experiment with five different seeds and report the mean and standard deviation values for each evaluation metric.

Results and Analysis
RQ 1. In order to control the trade-offs between model classification performance and EO fairness, we vary the values of λ in Eq.
(3) and Eq. (4). Figure 2 shows the classification performance and EO fairness of one-stage and two-stage CL when λ changes. Overall, as λ increases, the equalized odds gaps shrink at the cost of model classification performance. Compared to one-stage CL, twostage CL achieves more flexible trade-offs in general. Given the same range of λ, the change of equalized odds gaps in two-stage CL is more significant than in one-stage CL. At the same time, the corresponding model classification performances are comparable or remain better.

RQ 2.
We study the trade-offs between model performance and EO fairness of our proposed methods compared to the baselines. Figure 3 displays the performance and fairness of these methods under different hyperparameter settings for the jigsaw and biasbios datasets (trade-off parameters for all methods are described in more detail in Appendix B). Among all methods, we find that two-stage CL and INLP achieve the best performance and fairness trade-offs. In the biasbios dataset, twostage CL and INLP achieve similar performance and fairness trade-offs, and two-stage CL achieves more consistent results (i.e., lower variance). In the jigsaw dataset, two-stage CL achieves more flexible performance and fairness trade-offs as it reaches the highest model performance. Besides, when F1 scores are around 0.58, two-stage CL also achieves more consistent results and a lower EO gap. Meanwhile, when F1 scores are between 0.62∼0.64, INLP performs better. We note that the effectiveness of INLP highly depends on the pretrained encoder for INLP (see Appendix C.3 for the effects of different pre-training strategies for the text encoder in INLP), and a slight change in the text encoder could lead to a significant difference in the results, while CL-based methods target training the text encoder directly to ensure EO fairness and we demonstrate they are stable under hyperparame-  ter changes (see RQ 3 below).
In comparison, the adversarial-training-based methods are relatively more unstable and consistently perform worse than CL-based methods and INLP, especially in the biasbios dataset. Furthermore, both adversarial-training-based methods and INLP introduce additional model components (e.g., adversarial networks in adversarial-training-based methods and linear guarding layer in INLP) during training or inference, which complicates the actual implementation of the whole pipeline. In contrast, CL-based methods are well-suited to pre-training and fine-tuning paradigms in NLP applications.
RQ 3. We have shown that two-stage CL performs better than one-stage CL in RQ 1 and RQ 2. Thus, we choose two-stage CL to see if it is sensitive to key hyperparameter changes. As mentioned above, the performance of contrastive learning is closely related to temperature, (pre-training) batch size, and data augmentation strategy. Thus, we study whether the performance of two-stage CL is sensitive to these hyperparameters. Figure 4 shows model performance and EO fairness of two-stage CL under different hyperparameter settings when λ ∈ {0.0, 5.0} in the biasbios dataset ( Figure 9 for the jigsaw in Appendix C.1). We see that two-stage CL are stable under a wide range of parameter settings: The equalized odds gaps are consistently decreasing when λ = 5.0 and the F1 scores are relatively high. Compared to the previous work, our work uses equalized odds as the fairness criterion. To the best of our knowledge, our work is the first to connect the problem of learning fair representations with contrastive learning to ensure the EO constraint and explore its effectiveness for bias mitigation in text classification in large language models (e.g., BERT).

Conclusion
In this paper, we theoretically and empirically study how to leverage contrastive learning for fair text classification. Inspired by our theoretical results, we propose conditional supervised contrastive objectives to learn aligned and uniform representations while mixing the representation of different examples that share the same sensitive attribute for every task label. We conduct experiments to demonstrate the effectiveness of our algorithms in learning fair representations for text classification and show that our methods are stable in different hyperparameter settings. In the future, we plan to extend our algorithms to the settings of intersectional bias (Kearns et al., 2018;Yang et al., 2020).

Limitations
Like most prior work (Ravfogel et al., 2020;Tsai et al., 2022), we conduct experiments on the binary sensitive attribute. Especially, we acknowledge that due to the limitation of the dataset, our analysis of gender bias only considers binary gender, which is not ideal (Dev et al., 2021). One interesting future direction is to extend our method to ensure fairness for intersectional groups (Kearns et al., 2018;Yang et al., 2020). In principle, our theory also holds in intersectional bias. However, the disproportionate distributions of the intersectional sensitive attributes might pose challenges in sampling negative examples in L CS-InfoNCE . One possible solution is to use a memory bank to sample negative examples (Wu et al., 2018). We leave this analysis as future work as it is an important question that warrants an independent study.

Ethics Statement
This work aims for bias mitigation for text classification. Like other bias mitigation methods, it could help increase people's trust in NLP models. For example, our methods could help reduce unintentional censoring (false positive cases) and debates or discomforts (false negative cases) in online comment toxicity classification (see Section 2 for more details). Our study targets equalized odds and do not capture all notions of bias (e.g., individual fairness) in text classification. These issues are universal to bias mitigation techniques and not particular to our use case. Proof. By the definition of conditional mutual information:

References
Next, we prove the opposite side, which completes the proof.
A.2 Proof of Proposition 3.1 Proof.
where the second line follows the marginal of a joint distribution can be expressed as the expectation of the corresponding conditional distribution, and the third line follows Jensen's Inequality.

A.3 Proof of Proposition 3.2
The proof techniques used in Proposition 3.2 follow in Proposition 2.4 in Tsai et al. (2021), which could be dated back to Oord et al. (2018); Poole et al. (2019). To make the paper self-contained, we include all the details of the lemmas to get the final results.
The proof of Proposition 3.2 is dependent on the Lemmas A.1-A.5 showed in Figures 5 and 6 as well as Proposition A.1 in Figure 7. Finally, we present the proof of Proposition 3.2 in Figure 8. Jigsaw. Our first dataset, which we refer to as jigsaw, is a corpus of comments from an online forum associated with a toxicity rating. jigsaw's main task is binary classification: given a "toxicity" score in the range [0, 1] that has been assigned to each comment, we determine whether the "toxicity" score is greater or equal to 0.5. Each comment is also annotated with some "identity" labels, indicating whether some identities belonging to specific demographic groups are mentioned in the comment. We focus on the identity labels related to "race or ethnicity" and binaries the identity labels into black and non-black. Note that there are other sensitive attributes in the Jigsaw-Toxicity dataset, and we constrain the scope of our study to the "race" attributes present in text classification datasets. We follow (Koh et al., 2021) to perform the train/val/test splits. The data with "race or ethnicity" identity labels are split into training, validation, and test sets, summarized in Table 1.
Bias-in-Bios. To measure model fairness and performance in the multi-class classification setting, we use the professional biographies dataset of (De-Arteaga et al., 2019), which we refer to as the biasbios dataset. The data consist of nearly 400,000 online biographies collected from the Common Crawl corpus. These biographies are annotated with one of the 28 professions to which their subject belongs. The data are mapped to a binary gender based on the occurrence of gendered pronouns and are scrubbed to exclude the authors' names and pronouns. It is worth noting that mapping gender to binary labels is a strong simplified assumption to map data to a demographic label cleanly; it ignores people who do not identify as female or male, as well as the complexity of gender identity more generally. We refer readers to the original work (De-Arteaga et al., 2019) for further discussion of these issues. For our experiments, we attempt to predict the profession as our task label while protecting against the gender attribute. We replicate the splits of biasbios used by (Ravfogel et al., 2020), which are summarized in Table 2.

B.2 Detailed Implementations and Hyperparameter Settings
In this section, we provide more details on our implementations and give the hyperparameters we use in our experiments. We first detail how we tune each method's performance and fairness trade-offs.
• One-stage / Two-stage CL. for one-and twostage CL, once we determine the best classification performance by conducting a grid search on temperature, (pre-training) batch size, and data augmentation strategies (as well as γ in one-stage CL) , we only tune the parameter λ described in Sec. 3.2, which affects the trade-offs between supervised contrastive loss L sup and the conditional supervised In-foNCE loss L CS-InfoNCE . For two-stage CL, we set the pre-training epochs to be 15 and 25 for jigsaw and biasbios, respectively, and early stop the pre-training if there is no improvement on the validation set for three consecutive epochs.
• Diverse adversarial training. Following Han et al. (2021), we use an ensemble of three adversarial discriminators and the same adversarial network architecture. There are two hyperparameters of interest: λ dif f and λ adv . λ dif f is a difference loss hyperparameter that encourages discriminators to learn orthogonal representations. λ adv affects the trade-offs between task performance and fairness. We first do a grid search on λ dif f = {0, 100, 1000, 5000} and vary the values of λ adv to determine the best hyperparameter configurations.
• Adversarial training. The implementation is nearly identical to diverse adversarial training, except that there is just one adversarial discriminator.
• INLP. Following Ravfogel et al. (2020), we use the weights of an SVM classifier as the parameters of the linear guarding layer and Lemma A.1 ((Nguyen et al., 2010)). Let Z be the sample space for Z ′ and Z, s : Z × Z → R be any function, and P and Q be the probability measures over Z × Z. We have Proof. We first get the second-order functional derivative of the objective: − exp(s(z ′ , z)) · dQ, which is negative and it implies there is a supreme value for the objective. Next, we set the first-order functional derivative of the objective to be zero: dP − exp(s(z ′ , z)) · dQ = 0.
Reorganizing the equation above we get the optimal similarity function s * (z ′ , z) = log( dP dQ ). Plugging it into the original objective, we have Lemma A.2 (Four-variable variant of Lemma A.1). Let Z be the sample space for Z ′ and Z, Y be the sample space for Y , A be the sample space for A, s : Z × Z × Y × A → R be any function, and P and Q be the probability measures over Z × Z × Y × A. We have Proof. The proof technique is identical to the proof of Lemma A.1 and the only difference is that the similarity function takes four variables as input.
where the first line follows the fact that D KL (P∥Q) is a constant, the second line follows Lemma A.1, the third line follows the fact that (z ′ , z 1 ) and (z ′ , z 2:N ) are interchangeable when sampling from Q. Thus, for any similarity function s, we have sup s E (z ′ ,z 1 )∼P,(z ′ ,z 2:N )∼Q ⊗N −1 log exp(s(z ′ , z 1 )) 1 N N j=1 exp(s(z ′ , z j )) ≤ D KL (P∥Q) Lemma A.4.
Proof. We use Lemma A.1 and substitute P and Q with P Z ′ ,Z and E P A,Y [P Z ′ |A,Y P Z|A,Y ], respectively.
Proof. We use Lemma A.2 and substitute P and Q with P Z ′ ,Z,A,Y and P A,Y P Z ′ |A,Y P Z|A,Y , respectively. Proposition A.1.
Proof. We have where the first equation follows Lemma A.4. Let s * (z ′ , z) be the function when the supreme value is achieved and letŝ * (z ′ , z, a, y) = s * (z ′ , z), ∀ (a, y) ∈ P A,Y , and we have where the last equation follows Lemma A.5. Proof. Define two probability measures P = P Z ′ ,Z and Q = E P A,Y [P Z ′ |A,Y P Z|A,Y ], we have where the second equation follows Lemma A.3 and the last equation follows Proposition A.1.    Table 4 summarizes the trade-off hyperparameter choices for the biasbios dataset. The remaining hyperparamters for all methods are listed in Table 5.

B.3 Data Augmentation Strategies
In this section, we provide a description of the data augmentation strategies used in CL-based methods 5 .
• Easy data augmentation (EDA)(Wei and Zou, 2019). EDA consists of four simple operations: synonym replacement, random insertion, random swap, and random deletion. Following the suggestions provided by the original paper, we choose the augmentation ratio to be 0.1 and create four augmented examples per example.
• Back translation (Edunov et al., 2018). It first translates the input example to another language and back to English. We use the machine translation model wmt19-en-de in our experiment.
• Word replacement using contextual language model (CLM insert) (Kobayashi, 2018). It replaces words based on a language model that leverages contextual word embeddings to find the most similar word for augmentation. We use the RoBERTa-base language model and choose the augmentation rate of 0.1.
• Word insertion using contextual language model (CLM insert) (Kobayashi, 2018). It inserts words based on a language model that leverages contextual word embeddings to find the most similar word for augmentation. We use the RoBERTa-base language model and choose the augmentation rate of 0.1.  Table 3: Trade-off hyperparameters tested for RQ2 (Figure 3) for the jigsaw dataset.

C.1 More Comments for CL-based Methods
Our method achieves highly consistent results w.r.t. fairness and performance compared to the baseline methods. Figure 9 visualizes the model performance and EO gaps of two-stage CL under different hyperparameter settings when λ ∈ {0.0, 2.0} in the jigsaw dataset.

C.2 Visualization of the BERT Embeddings using Different Objectives
In Figure 10, we show the T-SNE visualization (Van der Maaten and Hinton, 2008) of text embeddings learned with different training objectives. We can see that both CE-trained and CLtrained embeddings capture the class information well (points with the same markers form their own clusters). However, points with the same sensitive attributes (the same colors) within the same class are more likely to form small clusters. When we introduce L CS-InfoNCE , those points tend to be more aligned.

C.3 How Different Pretrained Text Encoders affects the Performance of INLP?
To provide the clearest comparison between our proposed methods and the baselines, we used the best settings for the baseline methods we could attain. Nonetheless, we observed that the performance of INLP was highly sensitive to the encoder training settings, which could be an important practical consideration for practitioners selecting between different ways of improving model fairness. Figure 11 compares the performance and fairness of INLP using different encoder pre-training strategies. We see that in both datasets, the classification and fairness performance of INLP changes drastically even with the same values of trade-off parameter. Even training with the same objectives (CE loss), the text encoders obtained in different epochs after convergence greatly affect its performance. For example, the CE-trained encoder obtained in the last epoch of training nearly shows no effects on bias mitigation. If we do not train the text encoder using our datasets and directly use the parameters of the bert-base-uncased (this is the experimental setting of the previous work (Ravfogel et al., 2020)), the model performances drastically decrease as the training iterations of INLP increase. Lastly, INLP does not perform well when using supervised contrastive loss to train the text encoder. In comparison, our methods are more robust to hyperparameter changes.

CE Trained Embedding
Two-stage CL Trained Embedding λ = 0.0 Two-stage CL Trained Embedding λ = 5.0 Figure 10: T-SNE visualization of text embeddings using different training objectives (zoom in for better visualization) in the biasbios dataset. Different colors indicate different sensitive attributes (e.g., red for males and green for females), and different markers indicate different classes. CE-trained and CL-trained embedding capture the class information well (points with the same markers form their own clusters). However, points with the same sensitive attributes within the same class are more likely to form small clusters. When we introduce L CS-InfoNCE , those points tend to be more aligned.