A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

One of the pursued objectives of deep learning is to provide tools that learn abstract representations of reality from the observation of multiple contextual situations. More precisely, one wishes to extract disentangled representations which are (i) low dimensional and (ii) whose components are independent and correspond to concepts capturing the essence of the objects under consideration (Locatello et al., 2019b). One step towards this ambitious project consists in learning disentangled representations with respect to a predefined (sensitive) attribute, e.g., the gender or age of the writer. Perhaps one of the main application for such disentangled representations is fair classification. Existing methods extract the last layer of a neural network trained with a loss that is composed of a cross-entropy objective and a disentanglement regularizer. In this work, we adopt an information-theoretic view of this problem which motivates a novel family of regularizers that minimizes the mutual information between the latent representation and the sensitive attribute conditional to the target. The resulting set of losses, called CLINIC, is parameter free and thus, it is easier and faster to train. CLINIC losses are studied through extensive numerical experiments by training over 2k neural networks. We demonstrate that our methods offer a better disentanglement/accuracy trade-off than previous techniques, and generalize better than training with cross-entropy loss solely provided that the disentanglement task is not too constraining.


Introduction
There has been a recent surge towards disentangled representations techniques in deep learning (Mathieu et al., 2019;Locatello et al., 2019aLocatello et al., , 2020;;Gabbay and Hoshen, 2019).Learning disentangled representations from high dimensional data ultimately aims at separating a few explanatory factors (Bengio et al., 2013) that contain meaningful information on the objects of interest, regardless of specific variations or contexts.A disentangled representation has the major advantage of being less sensitive to accidental variations (e.g., style) and thus generalizes well.In this work, we focus on a specific disentanglement task which aims at learning a representation independent from a predefined attribute S. Such a representation will be called disentangled with respect to S. This task can be seen as a first step towards the ideal goal of learning a perfectly disentangled representation.Moreover, it is particularly well-suited for fairness applications, such as fair classification, which are nowadays increasingly sought after.When a learned representation Z is disentangled from a sensitive attribute S such as the age or the gender, any decision rule based on Z is independent of S, making it fair in some sense.Learning disentangled representations with respect to a sensitive attribute is challenging and previous works in the Natural Language Processing (NLP) community were based on two types of approach.The first one consists in training an adversary (Elazar and Goldberg, 2018;Coavoux et al., 2018).Despite encouraging results during training, Lample et al. (2018) show that a new adversary trained from scratch on the obtained representation is able to infer the predefined attribute, suggesting that the representation is in fact not disentangled.The second one consists in training a variational surrogate (Cheng et al., 2020;Colombo et al., 2021c;John et al., 2018) of the mutual information (MI) (Cover and Thomas, 2006) between the learned representation and the variable from which one wishes to disentangle it.One of the major weaknesses of both approaches is the presence of an additional optimization loop designed to learn additional parameters during training (the parameters of the adversary for the first method, and the parameters required to approximate the MI for the second one), which is time-consuming and requires careful tuning (see Alg. 1).
Our contributions.We introduce a new method to learn disentangled representations, with a particular focus on fair representations in the context of NLP.Our contributions are two-fold: (1) We provide new perspectives on the disentanglement with respect to a predefined attribute problem using information-theoretic concepts.Our analysis motivates the introduction of new losses tailored for classification, called CLINIC (Conditional mutuaL InformatioN mInimization for fair ClassifIcAtioN).It is faster than previous approaches as it does not to require to learn additional parameters.One of the main novelty of CLINIC is to minimize the MI between the latent representation and the sensitive attribute conditional to the target which leads to high disentanglement capability while maintaining high-predictive power.
(2) We conduct extensive numerical experiments which illustrate and validate our methodology.More precisely, we train over 2K neural models on four different datasets and conduct various ablation studies using both Recurrent Neural Networks (RNN) and pre-trained transformers (PT) models.Our results show that the CLINIC's objective is better suited than existing methods, it is faster to train and requires less tuning as it does not have learnable parameters.Interestingly, in some scenarios, it can increase both disentanglement and classification accuracies and thus, overcoming the classical disentanglement-accuracy trade-off.From a practical perspective, we would like to add that our method is well-suited for fairness applications and is compliant with the fairness through unawareness principle that obliges one to fix a single model which is later applied across all groups of interest (Gajane and Pechenizkiy, 2017;Lipton et al., 2018).Indeed, our method only requires access to the sensitive attribute during its learning phase, but the resulting learned prediction function does not take the sensitive variable as an input.

Related Works
In order to describe existing works, we begin by introducing some useful notations.From an input textual data represented by a random variable (r.v.) X ∈ X , the goal is to learn a parameter θ ∈ Θ of an encoder f θ : X → Z ⊂ R d so as to transform X into a latent vector Z = f θ (X) of dimension d that summarizes the useful information of X. Addi-tionally, we require the learned embedding Z to be guarded (following the terminology of (Elazar and Goldberg, 2018)) from an input sensitive attribute S ∈ S associated to X, in the sense that no classifier can predict S from Z better than a random guess.The final decision is done through predictor g ϕ that makes a prediction Ŷ = g ϕ (Z) ∈ Y, where ϕ ∈ Φ refers to the learned parameters.We will consider classification problems where Y is a discrete finite set.

Disentangled Representations
The main idea behind most of the previous works focusing on the learning of disentangled representations consists in adding a disentanglement regularizer to a learning task objective.The resulting mainstream loss takes the form: where ϕ denotes the trainable parameters of the regularizer.Let us describe the two main types of regularizers used in the literature, both having the flaw to require a nested loop, which adds an extra complexity to the training procedure (see Alg. 1).Adversarial losses.They rely on fooling a classifier (the adversary) trained to recover the sensitive attribute S from Z.As a result, the corresponding disentanglement regularizer is trained by relying on the cross-entropy (CE) loss between the predicted sensitive attribute and the ground truth label S. Despite encouraging results, adversarial methods are known to be unstable both in terms of training dynamics (Sridhar et al., 2021;Zhang et al., 2019) and initial conditions (Wong et al., 2020).

Losses based on Mutual Information (MI).
These losses, which also rely on learned parameters, aim at minimizing the MI I(Z; S) between Z and S, which is defined by where the joint probability density function (pdf) of the tuple (Z, S) is denoted p ZS and the respective marginal pdfs are denoted p Z and p S .Recent MI estimators include MINE (Belghazi et al., 2018), NWJ (Nguyen et al., 2010), CLUB (Cheng et al., 2020), DOE (McAllester and Stratos, 2020), I α (Colombo et al., 2021c), SMILE (Song and Ermon, 2019).

Parameter Free Estimation of MI
The CLINIC's objective can be seen as part of the second type of losses although it does not involve additional learnable parameters.MI estimation can be done using contrastive learning surrogates (Chopra et al., 2005) which offer satisfactory approximations with theoretical guarantees (we refer the reader to Oord et al. (2018) for further details).Contrastive learning is connected to triplet loss (Schroff et al., 2015) and has been used to tackle the different problems including self-supervised or unsupervised representation learning (e.g.audio (Qian et al., 2021), image (Yamaguchi et al., 2019), text (Reimers and Gurevych, 2019;Logeswaran and Lee, 2018)).It consists in bringing closer pairs of similar inputs, called positive pairs and further dissimilar ones, called negative pairs.The positive pairs can be obtained by data augmentation techniques (Chen et al., 2020) or using various heuristic (e.g similar sentences belong to the same document (Giorgi et al., 2020), backtranslation (Fang et al., 2020) or more complex techniques (Qu et al., 2020;Gillick et al., 2019;Shen et al., 2020)).For a deeper dive in mining techniques used in NLP, we refer the reader to Rethmeier and Augenstein (2021).
One of the novelty of CLINIC is to provide a novel information theoretic objective tailored for fair classification.It incorporates both the sensitive and target labels in the disentanglement regularizer.

Fair Classification and Disentanglement
The increasing use of machine learning systems in everyday applications has raised many concerns about the fairness of the deployed algorithms.Works addressing fair classification can be grouped into three main categories, depending on the step at which the practitioner performs a fairness intervention in the learning process: (i) pre-processing (Brunet et al., 2019;Kamiran and Calders, 2012), (ii) in-processing (Colombo et al., 2021c;Barrett et al., 2019) and (iii) post-processing (d'Alessandro et al., 2017) techniques (we refer the reader to Caton and Haas (2020) for exhaustive review).
When the attribute for which we want to disentangle the representation is a sensitive attribute (e.g.gender, age, race), our method can be considered as an in-process fairness technique.

Model and Training Objective
In this section, we introduce the new set of losses called CLINIC that is designed to learn disentangled representations.We begin with information theory considerations which allow us to derive a training objective, and then discuss the relation to existing losses relying on MI.
When learning disentangled representations, the goal is to obtain a representation Z that contains no information about a sensitive attribute S but preserves the maximum amount of information between Z and the target label Y .We use Veyne diagrams in Fig. 1 to illustrate the situation.Notice that, for a given task, the MI between Y and S is fixed (i.e corresponding to C Y ∩ C S ).Therefore, any representation Z that maximizes the MI with Y (i.e corresponding to C Y ∩C Z ) cannot hope to have a mutual information with S lower than I(Y ; S).
Informally, recalling that Z = f θ (X), we would like to solve: where λ > 0 controls the magnitude of the penalization.Existing works (Elazar and Goldberg, 2018;Barrett et al., 2019;Coavoux et al., 2018) rely on the CE loss to maximize the first term, i.e., within the area of C Z ∩ C Y in Fig. 1, and either on adversarial or contrastive methods for minimizing the second term, i.e., the area of C Z ∩ C S in Fig. 1).The ideal objective is to maximize the area of (C Z ∩ C Y ) \ C S .We refer to Colombo et al. (2021c) for connections between adversarial learning and MI and to Oord et al. (2018) for connections between contrastive learning and MI.
a better C Z Fig. 1: Veyne diagrams visualization.

Limitations of Previous Methods
Since I(Z; S) = I(Z; S|Y ) + I(Z; S; Y ), when minimizing I(Z; S), previous work minimize actually the two terms I(Z; S|Y ) and I(Z; S; Y ).This could be problematic since I(Z; S; Y ) tends to decrease the MI between Z and Y , a phenomenon we would like to avoid to keep high performance on our target task.Our method will bypass this issue by minimizing the conditional mutual information I(Z; S|Y ) solely (this amounts to minimize the area of

CLINIC
Motivated by the previous analysis CLINIC aims at maximizing the new following ideal objective: Minimizing I(Z; S) in Eq. 4 instead of I(Z; S) in Eq. 3 alleviates previously identified flaws.

Estimation of I(Z; S|Y )
The estimation of MI related quantities is known to be difficult (Pichler et al., 2020;Paninski, 2003).
As a result, to estimate I(Z; S|Y ), we develop a tailored made constrastive learning objectives to obtain a parameter free estimator.Let us describe the general form for the loss we adopt on a given input {x i , y i , s i } 1≤i≤B of size B. Recall that z i = f θ (x i ) is the output of the encoder for input x i .For each 1 ≤ i ≤ B, in fair classification task we have access to two subsets P(i), N (i) ⊂ {1, . . ., B} \ {i} corresponding respectively to positive and negative indices of examples.More precisely, P(i) (resp.N (i)) corresponds to the set of indices j ̸ = i such that z j is similar (resp.dissimilar) to z i .Then, CLINIC consists in minimizing a loss of the form Eq. 1 with L task given by the CE between the predictions g ϕ (z i ) and the groundtruth labels y i , and with R given by: where the contribution C i of sample i is As emphasized in Eq. 5, the term R depends on several hyperparameters: the choice of positive and negative examples (P(i), N (i)), the associated temperatures τ p , τ n > 0 and the batch size B. Compared to Eq. 1, the proposed regularizer does not require any additional trainable parameters ϕ.

Hyperparameters Choice
Sampling strategy for P and N .The choice of positive and negative samples is instrumental for contrastive learning (Wu et al., 2021;Karpukhin et al., 2020;Chen et al., 2020;Zhang and Stratos, 2021;Robinson et al., 2020).In the context of fair classification, the input data take the form (x i , s i , y i ) and we consider two natural strategies to define the subsets P and N .For any given y (resp.s), we denote by y (resp.s) a uniformly sampled label in Y \ {y} (resp. in S \ {s}).
Remark 1.It is usual in fairness applications to consider that the sensitive attribute is binary.In that case S is deterministic.
The first strategy (S 1 ) is to take The second strategy (S 2 ) is to set Influence of the temperature.As discussed in Wang and Liu (2021); Wang and Isola (2020) in the case where τ p = τ n , a good choice of temperature parameter is crucial for contrastive learning.
Our method offers additional versatility by allowing to fine tune two temperature parameters, τ p and τ n , respectively corresponding to the impact one wishes to put on positive and negative examples.
For instance, a choice of τ n < < 1 tends to focus on hard negative pairs while τ n > > 1 makes the penalty uniform among the negatives.We investigate this effect in Ssec.6.2.
Influence of the batch size.Previous works on contrastive losses (Henaff, 2020;Oord et al., 2018;Bachman et al., 2019;Mitrovic et al., 2020) argue for using large batch sizes to achieve good performances.In practice, hardware limits the maximum number of samples that can be stored in memory.Although several works (He et al., 2020;Gao et al., 2021), have been conducted to go beyond the memory usage limitation, every experiment we conducted was performed on a single GPU.Nonetheless, we provide an ablation study with respect to admissible batch sizes in Sssec.A.4.1.

Theoretical Guarantees
There exists a theoretical bound between the contrastive loss of Eq. 5 and the mutual information between two probability laws in the latent space, defined according to the sampling strategy for P and N .CLINIC's training objective offers theoretical guarantees when approximating I(Z; S|Y ) in Eq. 4. Formally, strategy S 1 and S 2 aim at minimizing the distance between: , and between .
We prove the following result in ??.
Theorem 1.For ϵ ∈ {0, 1}, denote by Then, it holds that Remark 2. Th. 1 offers theoretical guarantees that our adaptation of contrastive learning in Eq. 5 is a good approximation of I(Z; S|Y ) in Eq. 4.
Remark 3. To simplify the exposition we restricted to binary Y and S. The general case would involved quantities L y,s with y ∈ Y and S ∈ S.

Experimental Setting
In this section, we describe the experimental setting which includes the dataset, the metrics and the different baselines we will consider.Due to space limitations, details on hyperparameters and neural network architectures are gathered in Ap. C. To ensure fair comparisons we re-implement all the models in a unified framework.

Datasets
We use the DIAL dataset (Blodgett et al., 2016) to ensure backward comparison with previous works (Colombo et al., 2021c;Xie et al., 2017;Barrett et al., 2019).We additionally report results on TrustPilot (TRUST) (Hovy et al., 2015) that has also been used in Coavoux et al. (2018).Tweets from the DIAL corpus have been automatically gathered and labels for both polarity (is the expressed sentiment positive or negative?)and mention (is the tweet conversational?) are available.Sensitive attribute related to the race (is the author non-Hispanic black or non-Hispanic white?) has been inferred from both the author geo-location and the used vocabulary.For TRUST the main task consists in predicting a sentiment on a scale of five.The dataset is filtered and examples containing both the author birth date and gender are kept and splits follow Coavoux et al. (2018).These variables are used as sensitive information.To obtain binary sensitive attributes, we follow Hovy and Søgaard (2015) where age is binned into two categories (i.e.age under 35 and age over 45).
A word on the sensitive attribute inference.Notice that these two datasets are balanced with respect to the chosen sensitive attributes (S), which implies that a random guess has 50% accuracy.

Metrics
Previous works on learning disentangled representation rely on two metrics to assess performance.
Measuring disentanglement by reporting the accuracy of an adversary trained from scratch to predict the sensitive labels from the latent representation.Since both datasets are balanced, a perfectly disentangled representation corresponds to an accuracy of the adversary of 50%.Success on the main classification task which is measured with accuracy (higher is better).As we are interested in controlling the desired degree of disentanglement (Colombo et al., 2021c), we report the trade-off between these two metrics for different models when varying the λ parameter, which controls the magnitude of the regularization (see Eq. 1).For our experiments, we choose λ ∈ [0.001, 0.01, 0.1, 1, 10].GAP: To assess fairness we adopt the approach introduced by (Ravfogel et al., 2020), which involves calculating the root mean square of GAP T P R across all main classes.For GAP, lower is better.

Baselines
Losses.To compare CLINIC with previous works, we compare against adversarial training (ADV) (Elazar and Goldberg, 2018;Coavoux et al., 2018) and the recently introduced Mutual Information upper bound (Colombo et al., 2021c) (I α ) which has been shown to offer more control over the degree of disentanglement than previous estimators.We compare CLINIC with the work of (Chi et al., 2022;Shen et al., 2021;Gupta et al., 2021;Shen et al., 2022) which uses a method that estimates I(Z; S) (see Eq. 4).Beware that this baseline, be denoted as S 0 , does not incorporate information on Y .Encoders.To provide an exhaustive comparison, we work both with RNN-encoder and PT (e.g BERT (Devlin et al., 2018)) based architectures.Contrarily to previous works that use frozen PT (Rav-fogel et al., 2020), we fine-tune the encoder during training and evaluate our methods on various types of encoders (e.g.DISTILBERT (DIS.)(Sanh et al., 2019), ALBERT (ALB.)(Lan et al., 2019), SQUEEZEBERT (SQU.)(Iandola et al., 2020)).These models are selected based on efficiency.

Numerical Results
In this section, we gather experimental results on the fair classification task.Because of space constraints, additional results can be found in Ap. A.

Overall Results
We report in Tab. 1 the best model on each dataset for each of the considered methods.In Tab. 1, each row corresponds to a single λ which controls the weight of the regularizer (see Eq. 1).Global performance.For each dataset, we report the performance of a model trained without disentanglement regularization (CE rows in Tab. 1).
Results indicate this model relies on the sensitive attribute S to perform the classification task.In contrast, all disentanglement techniques reduce the predictability of S from the representation Z.Among these techniques, we observe that I α improves upon ADV as already pointed out in Colombo et al. (2021c).Our CLINIC based methods outperform both ADV and I α baselines, suggesting that contrastive regularization is a promising line of search for future work in disentanglement.
Comparing the strategies of CLINIC.Among the three considered sampling strategies for positives and negatives, S 1 and S 2 are the best and always improve performance upon S 0 .This is because S 1 and S 2 incorporate knowledge on the target task to construct positive and negative samples, which is crucial to obtain good performance.Datasets difficulty.From Tab. 1, we can observe that some sensitive/main label pairs are more difficult to disentangle than others.In this regard, TRUST is clearly easier to disentangled than DIAL.Indeed, every models except for CE achieve perfect disentanglement.This suggests we are in the case C Y ∩ C S = ∅ of Fig. 1, meaning that the sensitive attribute S only contains few information on the target Y .Within DIAL, sentiment label is the hardest but CLINIC with strategy S 1 achieves a good trade-off between accuracy and disentanglement.RNN vs BERT encoder.Interestingly, the BERT encoder is always harder to disentangle than the RNN encoder.This observation can be seen as an additional evidence that BERT may exhibit gender, age and/or race biases, as already pointed out in Ahn and Oh ( 2021

Controlling the Level of Disentanglement
A major challenge when disentangling representations is to be able to control the desired level of disentanglement (Feutry et al., 2018).We report performance on the main task and on the disentanglement task for both BERT (see Fig. 2) and RNN (see Fig. 3) for differing λ.Notice that we measure the performance of the disentanglement task by reporting the accuracy metric of a classifier trained on the learned representation.
Results on DIAL.We report a different behavior when working either with RNN or BERT.From Fig. 3b we observe that CLINIC (especially S 1 and S 2 ) allows us to both learn perfectly disentangled representation as well as allow a fined grained control over the desirable degree of disentanglement when working with RNN-based encoder.On the other hand, previous methods (e.g ADV or I α ) either fail to learn disentangled representations (see I α on Fig. 2a) or do it while losing the ability to predict Y (see ADV on Fig. 2a).As already pointed out in the analysis of Tab. 1, we observe again that it is both easier to learn disentangled representation and to control the desire degree of disentanglement with RNN compared to BERT.Results on TRUST.This dataset exhibits an interesting behaviour known as spurious correlations (Yule, 1926;Simon, 1954;Pearl et al., 2000).This means that, without any disentanglement penalty, the encoder learns information about the sensitive features that hurts the classification performance on the test set.We also observe that learning disentangled representation with CLINIC (using S 1 or S 2 ) outperforms a model trained with CE loss solely.This suggests that CLINIC could go beyond the standard disentanglement/accuracy trade-off.

Superiority of the CLINIC's Objective
In this experiment we assess the relevance of using I(Z; S|Y ) (Eq. 4) instead of I(Z; S) (Eq.3).For all the considered λ, all datasets and all the considered checkpoints, we display in Fig. 6 the disentanglement/accuracy trade-off.
Analysis.Each point in Fig. 6 corresponds to a trained model (with a specific λ).The more a point is at the bottom right, the better it is for our purpose.
Notice that the points stemming from our strategies S 1 and S 2 (orange/green) lie further down on the right than the point stemming from S 0 (bleu).For instance, the use of S 1 or S 2 for the RNN provide many models exhibiting perfect sensitive accuracy while maintaining high main accuracy, which is not the case for S 0 .For BERT, models trained with S 0 either have high sensitive and main accuracy or low sensitive and main accuracy.On the contrary, there are points stemming from S 1 or S 2 that lies on the bottom right of Fig. 6.Overall, Fig. 6 validates the use of I(Z; S|Y ) instead of I(Z; S).

Speed up Gain
In contrast with CLINIC, both previous methods ADV and I α rely on an additional network for the computation of the disentanglement regularizer in Eq. 1.These extra parameters need to be learned with the use of an additional loop during training: at each update of the encoder f θ , several updates of the network are performed to ensure that the regularizer computes the correct value.This loop is both times consuming and involves extra parameters (e.g new learning rates for the additional network) that must be tuned carefully.This makes ADV and I α more difficult to implement on largescale datasets.Tab. 2, illustrates both the parameter reduction and the speed up induced by CLINIC.Tab.2: Runtime for 1 epoch (using DIAL-S, B = 64 and relying on a single NVIDIA-V100 with 32GB of memory).The model sizes are given in thousand.We compute the relative improvement with respect to the strongest baseline I α from Colombo et al. (2021c).

Ablation Study
In this section, we conduct an ablation study on CLINIC with the best sampling strategy (S 1 ) to better understand the importance of its relative components.We focus on the effect of (i) the choice of PT models, (ii) the batch size and (iii) the temperature.This ablation study is conducted for both DIAL-S and TRUST-A, where we recall that the former is harder to disentangle than the latter.Results on TRUST-A can be found in Ap. A.

Changing the PT Model
Setting.As recently pointed out by Bommasani et al. (2021), PT plays a central role in NLP, thus the need to understand their effects is crucial.We test CLINIC with PT that are lighter and require less computation time to finetune than BERT.Analysis.The results of CLINIC trained with SQU., DIS. and ALB. are given in Fig. 4. Overall, we observe that CLINIC consistently achieves better results on all the considered models.Interestingly, for λ > 0.1, we observe that ADV degenerates: the main task accuracy is around 50% and the sensitive task accuracy either is 50% or reaches a high value.This phenomenon has been reported in Barrett et al. (2019); Colombo et al. (2021c).

Effect of the Temperature
Recall that CLINIC uses two different temperatures (see Eq. 5) denoted by τ p and τ n , corresponding to the magnitude one wishes to put on positive and negatives examples.In this experiment, we study their relative importance on the disentanglement/accuracy trade-off.Analysis.Fig. 5 gathers the performance of CLINIC for different (τ p , τ n ).We observe that low values of τ p (i.e focusing on easy positive) conduct to uninformative representation (i.e low accuracy for Y ).As τ p increases, the choice of τ n becomes relevant.Previously introduced supervised contrastive losses (Khosla et al., 2020) only use one temperature thus can only rely on diagonal score from Fig. 5. Since the chosen trade-off depends on the final application, we believe this ablation study validates the use of two temperatures.

Summary and Concluding Remarks
We introduced CLINIC, a set of novel losses tailored for fair classification.CLINIC both outperform existing disentanglement methods and can go beyond the traditional accuracy/disentanglement trade-off.Future works include (1) improving CLINIC to enable finer control over the disentanglement degree, and (2) developing a way to measure the accuracy/disentanglement trade-off which appears to differ for each dataset.

Limitations
This paper proposes a novel information-theoretic objective to learn disentangled representations.While the results held for English and studied pretrained encoders, we observed different behavior depending on the disentanglement difficulty.Overall predicting for which attribute or which data we will be able to see a positive trade-off while disentangling remains an open question.
Additionally, similarly to previous work in the same line of research we also assumed to have access to S which might not be the case for various practical applications.Note that although the main paper focuses on binary attributes for S we report additional results in Ssec.A.1.In general, we believe that our embeddings could be utilized for diverse applications, such as sentence generation and large language models.However, evaluating their performance on these specific tasks falls beyond the scope of this paper.Future research should focus on addressing these aspects.

Fig. 4 :
Fig. 4: Ablation study on PT on DIAL-S.Figures from left to right correspond to the performance of ALB., DIS. and SEQU..
Overall results on the fair classification task: the columns with Y and S stand for the main and the sensitive task accuracy respectively.↓ means lower is better whereas ↑ means higher is better.The best model is bolded and second best is underlined.CE refers to a model trained based on CE solely (case λ = 0 in Eq. 1)