A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations

Learning disentangled representations of textual data is essential for many natural language tasks such as fair classification, style transfer and sentence generation, among others. The existent dominant approaches in the context of text data {either rely} on training an adversary (discriminator) that aims at making attribute values difficult to be inferred from the latent code {or rely on minimising variational bounds of the mutual information between latent code and the value attribute}. {However, the available methods suffer of the impossibility to provide a fine-grained control of the degree (or force) of disentanglement.} {In contrast to} {adversarial methods}, which are remarkably simple, although the adversary seems to be performing perfectly well during the training phase, after it is completed a fair amount of information about the undesired attribute still remains. This paper introduces a novel variational upper bound to the mutual information between an attribute and the latent code of an encoder. Our bound aims at controlling the approximation error via the Renyi's divergence, leading to both better disentangled representations and in particular, a precise control of the desirable degree of disentanglement {than state-of-the-art methods proposed for textual data}. Furthermore, it does not suffer from the degeneracy of other losses in multi-class scenarios. We show the superiority of this method on fair classification and on textual style transfer tasks. Additionally, we provide new insights illustrating various trade-offs in style transfer when attempting to learn disentangled representations and quality of the generated sentence.


Introduction
Learning disentangled representations hold a central place to build rich embeddings of highdimensional data. For a representation to be disentangled implies that it factorizes some latent cause or causes of variation as formulated by (Bengio et al., 2013). For example, if there are two causes for the transformations in the data that do not generally happen together and are statistically distinguishable (e.g., factors occur independently), a maximally disentangled representation is expected to present a sparse structure that separates those causes. Disentangled representations have been shown to be useful for a large variety of data, such as video (Hsieh et al., 2018), image (Sanchez et al., 2019), text (John et al., 2018), audio (Hung et al., 2018), among others, and applied to many different tasks, e.g., robust and fair classification (Elazar and Goldberg, 2018), visual reasoning (van Steenkiste et al., 2019), style transfer (Fu et al., 2017), conditional generation (Denton et al., 2017;Burgess et al., 2018), few shot learning (Kumar Verma et al., 2018), among others.
In this work, we focus our attention on learning disentangled representations for text, as it remains overlooked by (John et al., 2018). Perhaps, one of the most popular applications of disentanglement in textual data is fair classification (Elazar and Goldberg, 2018;Barrett et al., 2019) and sentence generation tasks such as style transfer (John et al., 2018) or conditional sentence generation (Cheng et al., 2020b). For fair classification, perfectly disentangled latent representations can be used to ensure fairness as the decisions are taken based on representations which are statistically independent from-or at least carrying limited information about-the protected attributes. However, there exists a trade-offs between full disentangled representations and performances on the target task, as shown by (Feutry et al., 2018), among others. For sequence generation and in particular, for style transfer, learning disentangled representations aim at allowing an easier transfer of the desired style. To the best of our knowledge, a depth study of the relationship between disentangled representa-tions based either on adversarial losses solely or on vCLU B − S and quality of the generated sentences remains overlooked. Most of the previous studies have been focusing on either trade-offs between metrics computed on the generated sentences (Tikhonov et al., 2019) or performance evaluation of the disentanglement as part of (or convoluted with) more complex modules. This enhances the need to provide a fair evaluation of disentanglement methods by isolating their individual contributions (Yamshchikov et al., 2019;Cheng et al., 2020b). Methods to enforce disentangled representations can be grouped into two different categories. The first category relies on an adversarial term in the training objective that aims at ensuring that sensitive attribute values (e.g. race, sex, style) as statistically independent as possible from the encoded latent representation. Interestingly enough, several works (John et al., 2018;Elazar and Goldberg, 2018;Bao et al., 2019;Yi et al., 2020;Jain et al., 2019;Zhang et al., 2018;Hu et al., 2017), Elazar and Goldberg (2018) have recently shown that even though the adversary teacher seems to be performing remarkably well during training, after the training phase, a fair amount of information about the sensitive attributes still remains, and can be extracted from the encoded representation. The second category aim at minimising Mutual Information (MI) between encoded latent representation and the sensitive attribute values, i.e., without resorting to an adversarial discriminator. MI acts as an universal measure of dependence since it captures non-linear and statistical dependencies of high orders between the involved quantities (Kinney and Atwal, 2014). However, estimating MI has been a long-standing challenge, in particular when dealing with high-dimensional data (Paninski, 2003;Pichler et al., 2020). Recent methods rely on variational upper bounds. For instance, (Cheng et al., 2020b) study vCLUB-S (Cheng et al., 2020a for sentence generation tasks. Although this approach improves on previous state-of-the-art methods, it does not allow to fine-tuning of the desired degree of disentanglement, i.e., it enforces light or strong levels of disentanglement where only few features relevant to the input sentence remain (see Feutry et al. (2018) for further discussion).

Our Contributions
We develop new tools to build disentangled textual representations and evaluate them on fair classifi-cation and two sentence generation tasks, namely, style transfer and conditional sentence generation. Our main contributions are summarized below: • A novel objective to train disentangled representations from attributes. To overcome some of the limitations of both adversarial losses and vCLUB-S we derive a novel upper bound to the MI which aims at correcting the approximation error via either the Kullback-Leibler (Ali and Silvey, 1966) or Renyi (Rényi et al., 1961) divergences. This correction terms appears to be a key feature to fine-tuning the degree of disentanglement compared to vCLUB-S.
• Applications and numerical results. First, we demonstrate that the aforementioned surrogate is better suited than the widely used adversarial losses as well as vCLUB-S as it can provide better disentangled textual representations while allowing fine-tuning of the desired degree of disentanglement. In particular, we show that our method offers a better accuracy versus disentanglement trade-offs for fair classification tasks. We additionally demonstrate that our surrogate outperforms both methods when learning disentangled representations for style transfer and conditional sentence generation while not suffering (or degenerating) when the number of classes is greater than two, which is an apparent limitation of adversarial training. By isolating the disentanglement module, we identify and report existing tradeoffs between different degree of disentanglement and quality of generated sentences. The later includes content preservation between input and generated sentences and accuracy on the generated style.

Main Definitions and Related Works
We introduce notations, tasks, and closely related work. Consider a training set of n sentences x i ∈ X paired with attribute values y i ∈ Y ≡ {1, . . . , |Y|} which indicates a discrete attribute to be disentangled from the resulting representations. We study the following scenarios: Disentangled representations. Learning disentangled representations consists in learning a model M : X → R d that maps feature inputs X to a vector of dimension d that retains as much as possible information of the original content from the input sentence but as little as possible about the undesired attribute Y . In this framework, content is defined as any relevant information present in X that does not depend on Y .
Applications to binary fair classification. The task of fair classification through disentangled representations aims at building representations that are independent of selective discrete (sensitive) attributes (e.g., gender or race). This task consists in learning a model M : X → {0, 1} that maps any input x to a label l ∈ {0, 1}. The goal of the learner is to build a predictor that assigns each x to either 0 or 1 "oblivious" of the protected attribute y. Recently, much progress has been made on devising appropriate means of fairness, e.g., (Zemel et al., 2013;Zafar et al., 2017;Mohri et al., 2019). In particular, (Xie et al., 2017;Barrett et al., 2019;Elazar and Goldberg, 2018) approach the problem based on adversarial losses. More precisely, these approaches consist in learning an encoder that maps x into a representation vector h x , a critic C θc which attempts to predict y, and an output classifier f θ d used to predict l based on the observed h x . The classifier is said to be fair if there is no statistical information about y that is present in h x (Xie et al., 2017;Elazar and Goldberg, 2018).
Applications to conditional sentence generation. The task of conditional sentence generation consists in taking an input text containing specific stylistic properties to then generate a realistic (synthetic) text containing potentially different stylistic properties. It requests to learn a model M : X × Y → X that maps a pair of inputs (x, y t ) to a sentence x g , where the outcome sentence should retain as much as possible of the original content from the input sentence while having (potentially a new) attribute y g . Proposed approaches to tackle textual style transfer (Zhang et al., 2020;Xu et al., 2019) can be divided into two main categories. The first category (Prabhumoye et al., 2018;Lample et al., 2018) uses cycle losses based on back translation (Wieting et al., 2017) to ensure that the content is preserved during the transformation. Whereas, the second category look to explicitly separate attributes from the content. This constraint is enforced using either adversarial training (Fu et al., 2017;Hu et al., 2017;Zhang et al., 2018;Yamshchikov et al., 2019) or MI minimisation using vCLUB-S (Cheng et al., 2020b). Traditional adversarial training is based on an encoder that aims to fool the adversary discriminator by removing attribute information from the content embedding (Elazar and Goldberg, 2018). As we will observe, the more the representations are disentangled the easier is to transfer the style but at the same time the less the content is preserved. In order to approach the sequence generation tasks, we build on the Style-embedding Model by (John et al., 2018) (StyleEmb) which uses adversarial losses introduced in prior work for these dedicated tasks. During the training phase, the input sentence is fed to a sentence encoder, namely f θe , while the input style is fed to a separated style encoder, namely f s θe . During the inference phase, the desired style-potentially different from the input style-is provided as input along with the input sentence.

Model and Training Objective
This section describes the proposed approach to learn disentangled representations. We first review MI along with the model overview and then, we derive the variational bound we will use, and discuss connections with adversarial losses.

Model Overview
The MI is a key concept in information theory for measuring high-order statistical dependencies between random quantities. Given two random variables Z and Y , the MI is defined by where p ZY is the joint probability density function (pdf) of the random variables (Z, Y ), with p Z and p Y representing the respective marginal pdfs. MI is related to entropy h(Y ) and conditional entropy h(Y |Z) as follows: Our models for fair classification and sequence generation share a similar structure. These rely on an encoder that takes as input a random sentence X and maps it to a random representation Z using a deep encoder denoted by f θe . Then, classification and sentence generation are performed using either a classifier or an auto-regressive decoder denoted by f θ d . We aim at minimizing MI between the latent code represented by the Random Variable (RV) Z = f θe (X) and the desired attribute represented by the RV Y . The objective of interest L(f θe ) is defined as: where L down. represents a downstream specific (target task) loss and λ is a meta-parameter that controls the sensitive trade-off between disentanglement (i.e., minimizing MI) and success in the downstream task (i.e., minimizing the target loss). In Sec. 5, we illustrate theses different trade-offs. Applications to fair classification and sentence generation. For fair classification, we follow standard practices and optimize the cross-entropy between prediction and ground-truth labels. In the sentence generation task L down. represents the negative log-likelihood between individual tokens.

A Novel Upper Bound on MI
Estimating the MI is a long-standing challenge as the exact computation (Paninski, 2003) is only tractable for discrete variables, or for a limited family of problems where the underlying datadistribution satisfies smoothing properties, see recent work by (Pichler et al., 2020). Different from previous approaches leading to variational lower bounds (Belghazi et al., 2018;Hjelm et al., 2018;Oord et al., 2018), in this paper we derive an estimator based on a variational upper bound to the MI which control the approximation error based on the Kullback-Leibler and the Renyi divergences (Daudel et al., 2020).
Theorem 1 (Variational upper bound on MI) Let (Z, Y ) be an arbitrary pair of RVs with (Z, Y ) ∼ p ZY according to some underlying pdf, and let q Y |Z be a conditional variational distribution on the attributes satisfying p ZY p Z · q Y |Z , i.e., absolutely continuous. Then, we have that where KL p ZY p Z · q Y |Z denotes the KL divergence. Similarly, we have that for any α > 1, where q Y |Z (y|z) , for (z, y) ∈ Supp(p ZY ).
Proof: The upper bound on H(Y ) is a direct application of the the (Donsker and Varadhan, 1985) representation of KL divergence while the lower bound on H(Y |Z) follows from the monotonicity property of the function: α → D α p ZY p Z ·q Y |Z . Further details are relegated to Appendix A.
Remark: It is worth to emphasise that the KL divergence in equation 4 and Renyi divergence in equation 5 control the approximation error between the exact entropy and its corresponding bound.
From theoretical bounds to trainable surrogates to minimize MI: It is easy to check that the inequalities in (Eq. 4) and (Eq. 5) are tight provided that p ZY ≡ p Z · q Y |Z almost surely for some adequate choice of the variational distribution. However, the evaluation of these bounds requires to obtain an estimate of the density-ratio R(z, y). Density-ratio estimation has been widely studied in the literature (see (Sugiyama et al., 2012) and references therein) and confidence bounds has been reported by (Kpotufe, 2017) under some smoothing assumption on underlying data-distribution p ZY . In this work, we will estimate this ratio by using a critic C θ R which is trained to differentiate between a balanced dataset of positive i.i.d samples coming from p ZY and negative i.i.d samples coming from q Y |Z · p Z . Then, for any pair (z, y), the densityratio can be estimated by R(z, y) ≈ σ(C θ R (z,y)) 1−σ(C θ R (z,y)) , where σ(·) indicates the sigmoid function and C θ R (z, y) is the unnormalized output of the critic. It is worth to mention that after estimating this ratio, the previous upper bounds may not be strict bounds so we will refer them as surrogates.

Comparison to existing methods
Adversarial approaches: In order to enhance our understanding of why the proposed approach based on the minimization of the MI using our variational upper bound in Th. 1 may lead to a better training objective than previous adversarial losses, we discuss below the explicit relationship between MI and cross-entropy loss. Let Y ∈ Y denote a random attribute and let Z be a possibly highdimensional representation that needs to be disentangled from Y . Then, where CE( Y |Z) denotes the cross-entropy corresponding to the adversarial discriminator q Y |Z , not-ing that Y comes from an unknown distribution on which we have no influence H(Y ) is an unknown constant, and using that the approximation error: KL q ZY q Y |Z ·p Z = CE( Y |Z)−H(Y |Z). Eq. 6 shows that the cross-entropy loss leads to a lower bound (up to a constant) on the MI. Although the cross-entropy can lead to good estimates of the conditional entropy, the adversarial approaches for classification and sequence generation by (Barrett et al., 2019;John et al., 2018) which consists in maximizing the cross-entropy, induces a degeneracy (unbounded loss) as λ increases in the underlying optimization problem. As we will observe in next section, our variational upper bound in Th. 1 can overcome this issue, in particular for |Y| > 2.
vCLUB-S: Different from our method, Cheng et al. (2020a) introduce I vCLUB which is an upper bound on MI defined by It would be worth to mention that this bound follows a similar approach to the previously introduced bound in (Feutry et al., 2018).

Datasets
Fair classification task. We follow the experimental protocol of (Elazar and Goldberg, 2018). The main task consists in predicting a binary label representing either the sentiment (positive/negative) or the mention. The mention task aims at predicting if a tweet is conversational. Here the considered protected attribute is the race. The dataset has been automatically constructed from DIAL corpus (Blodgett et al., 2016) which contained race annotations over 50 Million of tweets. Sentiment tweets are extracted using a list of predefined emojis and mentions are identified using @mentions tokens. The final dataset contains 160k tweets for the training and two splits of 10K tweets for validation and testing. Splits are balanced such that the random estimator is likely to achieve 50% accuracy. Style Transfer For our sentence generation task, we conduct experiments on three different datasets extracted from restaurant reviews in Yelp. The first dataset, referred to as SYelp, contains 444101, 63483, and 126670 labelled short reviews (at most 20 words) for train, validation, and test, respectively. For each review a binary label is assigned depending on its polarity. Following (Lample et al., 2018), we use a second version of Yelp, referred to as FYelp, with longer reviews (at most 70 words). It contains five coarse-grained restaurant category labels (e.g., Asian, American, Mexican, Bars and Dessert). The multi-category FYelp is used to access the generalization capabilities of our methods to a multi-class scenario.

Metrics for Performance Evaluation
Efficiency measure of the disentanglement methods. (Barrett et al., 2019) report that offline classifiers (post training) outperform clearly adversarial discriminators. We will re-training a classifier on the latent representation learnt by the model and we will report its accuracy. Measure of performance within the fair classification task. In the fair classification task we aim at maximizing accuracy on the target task and so we will report the corresponding accuracy.
Measure of performance within sentence generation tasks. Sentences generated by the model are expected to be fluent, to preserve the input content and to contain the desired style. For style transfer, the desired style is different from the input style while for conditional sentence generation, both input and output styles should be similar. Nevertheless, automatic evaluation of generative models for text is still an open problem. We measure the style of the output sentence by using a fastText classifier (Joulin et al., 2016b). For content preservation, we follow (John et al., 2018) and compute both: (i) the cosine measure between source and generated sentence embeddings, which are the concatenation of min, max, and mean of word embedding (sentiment words removed), and (ii) the BLEU score between generated text and the input using SACRE-BLEU from (Post, 2018). Motivated by previous work, we evaluate the fluency of the language with the perplexity given by a GPT-2 (Radford et al., 2019) pretrained model performing fine-tuning on the training corpus. We choose to report the logperplexity since we believe it can better reflects the uncertainty of the language model (a small variation in the model loss would induce a large change in the perplexity due to the exponential term). Besides the automatic evaluation, we further test our disentangled representation effectiveness by human evaluation results are presented in Tab. 1. Conventions and abbreviations. Adv refers to a model trained using the adversarial loss; vCLUB-S, KL refers to a model trained using the vCLUB-S and KL surrogate (see Eq. 14) respectively; and D α refers to a model trained based on the α-Renyi surrogate (Eq. 15), for α ∈ {1.3, 1.5, 1.8}.

Numerical Results
In this section, we present our results on the fair classification and binary sequence generation tasks, see Ssec. 5.1 and Ssec. 5.2, respectively. We additionally show that our variational surrogates to the MI-contrarily to adversarial losses-do not suffer in multi-class scenarios (see Ssec. 5.3).

Applications to Fairness
Upper bound on performances. We first examine how much of the protected attribute we can be recovered from an unfair classifier (i.e., trained without adversarial loss) and how well does such classifier perform. Results are reported in Fig. 1. We observe that we achieve similar scores than the ones reported in previous studies (Barrett et al., 2019;Elazar and Goldberg, 2018). This experiment shows that, when training to solve the main task, the classifier learns information about the protected attribute, i.e., the attacker's accuracy is better than random guessing. In the following, we compare the different proposed methods to disentangle representations and obtain a fairer classifier.
Methods comparisons. Fig. 1 shows the results of the different models and illustrates the trade-offs between disentangled representations and the target task accuracy. Results are reported on the testset for both sentiment and mention tasks when race is the protected. We observe that the classifier trained with an adversarial loss degenerates for λ > 5 since the adversarial term in Eq. 3 is influencing much the global gradient than the downstream term (i.e., cross-entropy loss between predicted and golden distribution). Remarkably, both models trained to minimize either the KL or the Renyi surrogate do not suffer much from the aforementioned multiclass problem. For both tasks, we observe that the KL and the Renyi surrogates can offer better disentangled representations than those induced by adversarial approaches. In this task, both the KL and Renyi achieve perfect disentangled representations (i.e., random guessing accuracy on protected attributes) with a 5% drop in the accuracy of the target task, when perfectly masking the protected attributes. As a matter of fact, we ob-serve that vCLUB-S provides only two regimes: either a "light" protection (attacker accuracy around 60%), with almost no loss in task accuracy (λ < 1), or a strong protection (attacker accuracy around 50%), where a few features relevant to the target task remain. 1 On the sentiment task, we can draw similar conclusions. However, the Renyi's surrogate achieves slightly better-disentangled representations. Overall, we can observe that our proposed surrogate enables good control of the degree of disentangling. Additionally, we do not observe a degenerated behaviour-as it is the case with adversarial losses-when λ increases. Furthermore, our surrogate allows simultaneously better disentangled representations while preserving the accuracy of the target task.

Applications to binary polarity transfer
In the previous section, we have shown that the proposed surrogates do not suffer from limitations of adversarial losses and allow to achieve better disentangled representations than existing methods relying on vCLUB-S. Disentanglement modules are a core block for a large number of both style transfer and conditional sentence generation algorithms (Tikhonov et al., 2019;Yamshchikov et al., 2019;Fu et al., 2017) that place explicit constraints to force disentangled representations. First, we assess the disentanglement quality and the control over desired level of disentanglement while changing the downstream term, which for the sentence generation task is the cross-entropy loss on individual token. Then, we exhibit the existing trade-offs between quality of generated sentences, measured by the metric introduced in Ssec. 4.2, and the resulting degree of disentanglement. The results are presented for SYelp Fig. 2a shows the adversary accuracy of the different methods as a function of λ. Similarly to the fair classification task, a fair amount of information can be recovered from the embedding learnt with adversarial loss. In addition, we observe a clear degradation of its performance for values λ > 1. In this setting, the Renyi surrogates achieves consistently better results in terms of disentanglement than the one minimizing the KL surrogate. The curve for Renyi's surrogates shows that exploring different values of λ allows good control of the As λ increases the level of disentanglement increases and the proposed methods using both KL (KL) and Reny divergences (D α ) clearly offer better control than existing methods. disentanglement degree. Renyi surrogate generalizes well for sentence generation. Similarly to the fairness task vCLUB-S only offers two regimes: "light" disentanglement with very little polarity transfer and "strong" disentanglement.

Disentanglement in Polarity Transfer
The quality of generated sentences are evaluated using the fluency (see Fig. 3c ), the content preservation (see Fig. 3a), additional results using a cosine similarity are given in Appendix D, and polarity accuracy (see Fig. 3b ). For style transfer, and for all models, we observe trade-offs between disentanglement and content preservation (measured by BLEU) and between fluency and disentanglement. Learning disentangled representations leads to poorer content preservation. As a matter of fact, similar conclusions can be drawn while measuring content with the cosine similarity (see Appendix D). For polarity accuracy, in non-degenerated cases (see below), we observe that the model is able to better transfer the sentiment in presence of disentangled representations. Transferring style is easier with disentangled representations, however there is no free lunch here since disentangling also re-moves important information about the content. It is worth noting that even in the "strong" disentanglement regime vCLUB-S struggles to transfer the polarity ( In Fig. 2b we report the adversary accuracy of our different methods for the values of λ using FYelp   Figure 3: Numerical experiments on binary style transfer. Quality of generated sentences are evaluated using BLEU (Fig. 3a); style transfer accuracy (Fig. 3a); sentence fluency ( Fig. 3c). We report existing trade-offs between disentanglement and sentence generation quality. Human evaluation is reported in Tab. 1.
dataset with category label. In the binary setting for λ ≤ 1, models using adversarial loss can learn disentangled representations while in the multi-class setting, the adversarial loss degenerates for small values of λ (i.e sentences are no longer fluent as shown by the increase in perplexity in Fig. 4c).
Minimizing MI based on our surrogates seems to mitigate the problem and offer a better control of the disentanglement degree for various values of λ than vCLU B − S. Further results are gathered in Appendix G.

Summary and Concluding Remarks
We devised a new alternative method to adversarial losses capable of learning disentangled textual representation. Our method does not require adversarial training and hence, it does not suffer in presence of multi-class setups. A key feature of this method is to account for the approximation error incurred when bounding the mutual information. Experiments show better trade-offs than both adversarial training and vCLUB-S on two fair classification tasks and demonstrate the efficiency to learn disentangled representations for sequence generation.
As a matter of fact, there is no free-lunch for sentence generation tasks: although transferring style is easier with disentangled representations, it also removes important information about the content. Since it allows more finegrained control over the amount of disentanglement, we expect it to be easier to tune when combined with more complex models. In this section, we provide a formal proof of the Eq. 6. Let (Z, Y ) be an arbitrary pair of RVs with (Z, Y ) ∼ p ZY according to some underlying pdf, and let q Y |Z be a conditional variational probability distribution on the discrete attributes satisfying p ZY p Z · q Y |Z , i.e., absolutely continuous.

References
Proof: We start by the definition of the MI and use the fact that the maximum entropy distribution is reached for the uniform law in the case of a discrete variable (see (Cover and Thomas, 2006)).
We then need to find the relationship between the cross-entropy and the conditional entropy.
We know that KL(p Y Z q Y Z ) ≥ 0, thus CE( Y |Z) ≥ H(Y |Z) which gives the result. The underlying hypothesis made by approximating the MI with an adversarial loss is that the contribution of gradient from KL(p Y Z q Y Z ) to the bound is negligible.

A.2 Proof of Th. 1
Let (Z, Y ) be an arbitrary pair of RVs with (Z, Y ) ∼ p ZY according to some underlying pdf, and let q Y |Z be a conditional variational probability distribution satisfying p ZY p Z · q Y |Z , i.e., absolutely continuous. To obtain an upper bound on the MI we need to upper bound the entropy H(Y ) and to lower bound the conditional entropy H(Y |Z).
Upper bound on H(Y ). Since the KL divergence is non-negative, we have Lower bounds on H(Y |Z). We have the following inequalities: where KL(p Y Z p Z · q Y |Z ) denotes the KL divergence. Furthermore, for arbitrary values α > 1, The proof of Eq. 14 is given in Ssec. A.1. In order to show Eq. 15, we remark that Renyi divergence is non-decreasing function α → D α (p ZY p Z · q Y |Z ) in α ∈ [0, +∞) (the reader is refereed to (Van Erven and Harremos, 2014) for a detailed proof). Thus, we have ∀α > 1, Therefore, from expression Eq. 14 we obtain the desired result.

A.3 Optimization of the Surrogates on MI
In this section, we give details to facilitate the practical implementation of our methods.
(17) where C θc (z i ) y j is the y j -th component of the normalised output of the classifier C θc .

A.3.2 Computing the lower bound on H(Y |Z)
The upper bound helds for α > 1, Estimating the density-ratio R(z, y) In what follows we apply the so-called density-ratio trick to our specific setup. Suppose we have a bal- The density-ratio trick consists in training a classifier C θ R to distinguish between theses two distribution. Samples coming from p are labelled u = 1, samples coming from q are labelled u = 0. Thus, we can rewrite R(z, y) as Obviously, the true posterior distribution p U |Y Z is unknown. However, if C θ R is well trained, then p U |Y Z (u = 0|y, z) ≈ σ(C θ R (y, z)), where σ(·) denotes the sigmoid function. A detailled procedure for training is given in Algorithm 1.

B Additional Details on the Model B.1 Baseline Schemas
We report in Fig. 7 the schema of the proposed approach as well as the baselines.
Algorithm 1 Our method for the fair classification task INPUT: training dataset for the encoder D n = {(x 1 , y 1 , l 1 ), . . . , (x n , y n , l n )}, batch size m, training dataset for the classifiers and decoder D n = {(x 1 , y 1 , l 1 ), . . . , (x n , y n , l n )}. Initialization: parameters (θ e , θ R , θ c , θ d ) of the encoder f θe , classifiers C θ R , C θc , f θ d Optimization: 1: while (θ e , θ R , θ c , θ d ) not converged do 2: Update θ e with B using Eq. 3 with θ d . 10: end while OUTPUT: f θe , f θ d a dimension of 128 (as already reported by (Garcia et al., 2019), building experiments on higher dimensions produces marginal improvement). The style embedding is set to a dimension of 8. The attribute classifier are MLP and are composed of 3 layer MLP with 128 hidden units and LeakyReLU (Xu et al., 2015) activations, the dropout (Srivastava et al., 2014) rate is set to 0.1. All models are optimised with AdamW (Kingma and Ba, 2014;Loshchilov and Hutter, 2017) with a learning rate of 10 −3 and the norm is clipped to 1.0. Our model's hyperparameters have been set by a preliminary training on each downstream task: a simple classifier for the fair classification and a vanilla seq2seq (Sutskever et al., 2014;Colombo et al., 2020) for the conditional generation task. The models re-

C Additional Details on the experimental Setup
In this section, we provide additional details on the metric used for evaluating the different models.

C.1 Content Preservation: BLEU & Cosine Similarity
Content preservation is an important aspect of both conditional sentence generation and style transfer. We provide here the implementation details regarding the implemented metrics. BLEU. For computing the BLEU score we choose to use the corpus level method provided in python sacrebleu (Post, 2018) library https:// github.com/mjpost/sacrebleu.git. It produces the official WMT scores while working with plain text.
Cosine Similarity. For the cosine similarity, we follow the definition of John et al. (2018) by taking the cosinus between source and generated sentence embedding. For computing the embedding we rely on the bag of word model

H(Y |Z)
(a) Classifier with adversarial loss from (Elazar and Goldberg, 2018) (John et al., 2018) Figure 7: Baselines methods, theses models use an adversarial loss for disentanglement. f θe represents the input sentence encoder; f s θe denotes the style encoder (only used for sentence generation tasks); C θc represents the adversarial classifier; f θ d represents the decoder that can be either a classifier (Fig. 7a or a sequence decoder (Fig. 7b). Schemes of our proposed models are given in ??
and take the mean pooling of word embedding. We choose to use the pre-trained word vectors provided in https://fasttext.cc/docs/ en/pretrained-vectors.html. They are trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-  In Tab. 1, we report the performances of systems when evaluated by humans on the polarity transfer task. 100 sentences are generated by each system and 3 english native speakers are asked to annotate each sentence along 3 dimensions (i.e fluency, sentiment and content preservation). Turkers assign binary labels to fluency and sentiment (following the protocol introduced in Jalalzai et al. (2020)) while content is evaluated on a likert scale from 1-5. For content preservation, both the input sentence and the generated sentence are provided to the turker. The annotator agreement is measure by the Krippendorff Alpha 2 (Krippendorff, 2018). The Krippendorff Alpha is: α = 0.54 on the sentiment classification, α = 0.20 for fluency and α = 0.18 for content preservation.

D.2 Content preservation using Cosine Similarity
Fig. 8 measures the content preservation measured using cosine similarity for the sentence generation task using sentiment labels. As with the BLEU score, we observe that as the learnt representation becomes more entangled (λ increases) less content is preserved. Similarly to BLEU the model using the KL bound conserves outperforms other models 2 Krippendorff Alpha measures of inter-rater reliability in [0, 1]: 0 is perfect disagreement and 1 is perfect agreement. in terms of content preservation for λ > 5.

D.3 Example of generated sentences
Tab. 2 gathers some sentences generated by the different sentences for different values of λ.
Style transfert. From Tab. 2, we can observe that the impact of disentanglement on a qualitative point of view. For small values of λ the models struggle to do the style transfer (see example 2 for instance). As λ increases disentanglement becomes easier, however, the content becomes more generic which is a known problem (see (Li et al., 2015) for instance).
Example of "degeneracy" for large values of λ. For sentences generated with the baseline model a repetition phenomenon appears for greater values of λ. For certain sentences, models ignore the style token (i.e., the sentence generated with a positive sentiment is the same as the one generated with the negative sentiment). We attribute this degeneracy to the fact that the model is only trained with (x i , y i ) sharing the same sentiment which appears to be an intrinsic limitation of the model introduced by (John et al., 2018).
Analysis of performances of vCLUB-S Similarly to what can be observed with automatic evaluation Tab. 2 shows that the system based on vCLUB-S has only two regimes: "light" disentanglement and strong disentanglement. With light disentanglement the decoder fail at transferring the polarity and for strong disentanglement few content features remain and the system tends to output generic sentences.

E Additional Results on Multi class Sentence Generation
Results on the multi-class style transfer and on are reported in Fig. 4b Similarly than in the binary case there exists a trade-off between content preservation and style transfer accuracy. We observe that the BLEU score in this task is in a similar range than the one in the gender task, which is expected because data come from the same dataset where only the labels changed. it's not good, but the prices are good. D α=1.5 it's not very good, and the service was terrible. D α=1. 8 it was a very disappointing experience and the food was awful. it was a little overpriced and not very good. D α=1.5 it's a shame, and the service is horrible. D α=1.8 it's not worth the $ NUM.  In Fig. 9, we report the adversary accuracy of the different methods for the values of λ. It is worth noting that gender labels are noisier than sentiment labels (Lample et al., 2018). We observe that the adversarial loss saturates at 55% where a model trained on MI bounds can achieve a better disentanglement. Additionally, the models trained with MI bounds allow better control of the desired degree of disentanglement.

F.2 Quality of Generated Sentences
Results on the sentence generation tasks are reported in Fig. 10 and in Fig. 11. We observe that for λ > 1 the adversarial loss degenerates as observe in the sentiment experiments.Compared to sentiment score we observe a lower score of BLEU which can be explained by the length of the review in the FYelp dataset. On the other hand, we observe a similar trade-off between style transfer accuracy and content preservation in the non degenerated case: as style transfer accuracy increases, content preservation decreases. Overall, we remark a behaviour similar to the one we observe in sentiment experiments.

G Additional Results on Multi class Sentence Generation
Results on the multi-class style transfer and on conditional sentence generation are reported in Fig. 4b  and ??. Similarly than in the binary case there exists a trade-off between content preservation and style transfer accuracy. We observe that the BLEU score in this task is in a similar range than the one in the gender task, which is expected because data come from the same dataset where only the labels changed.   Figure 11: Numerical experiments on conditional sentence generation using gender labels. Results includes: BLEU (Fig. 11a); cosine similarity (Fig. 11d); style transfer accuracy (Fig. 11b); sentence fluency ( Fig. 11c). (d) Figure 12: Numerical experiments on the multi-class conditionnal sentence generation. Results include: BLEU (Fig. 12a); cosine similarity ( Fig. 12d); style transfer accuracy (Fig. 12b); sentence fluency ( Fig. 12c).