Linear Guardedness and its Implications

Methods for erasing human-interpretable concepts from neural representations that assume linearity have been found to be tractable and useful.However, the impact of this removal on the behavior of downstream classifiers trained on the modified representations is not fully understood.In this work, we formally define the notion of linear guardedness as the inability of an adversary to predict the concept directly from the representation, and study its implications.We show that, in the binary case, under certain assumptions, a downstream log-linear model cannot recover the erased concept.However, we constructively demonstrate that a multiclass log-linear model can be constructed that indirectly recovers the concept in some cases, pointing to the inherent limitations of linear guardedness as a downstream bias mitigation technique.These findings shed light on the theoretical limitations of linear erasure methods and highlight the need for further research on the connections between intrinsic and extrinsic bias in neural models.

A common instantiation of concept erasure is removing a concept (e.g., gender) from a representation (e.g., the last hidden representation of a transformer-based language model) such that it cannot be predicted by a log-linear model.Then, one fits a secondary log-linear model for a downstream task over the erased representations.For example, one may fit a log-linear sentiment analyzer to predict sentiment from gender-erased representations.The hope behind such a pipeline is that, because the concept of gender was erased from the representations, the predictions made by the log-linear sentiment analyzer are oblivious to gender.Previous work (Ravfogel et al., 2020;Elazar et al., 2021;Jacovi et al., 2021;Ravfogel et al., 2022a) has implicitly or explicitly relied on this assumption that erasing concepts from representations would also result in a downstream classifier that was oblivious to the target concept.
In this paper, we formally analyze the effect concept erasure has on a downstream classifier.We start by formalizing concept erasure using Xu et al.'s (2020) V-information. 1 We then spell out the related notion of guardedness as the inability to predict a given concept from concept-erased representations using a specific family of classifiers.Formally, if V is the family of distributions realizable by a log-linear model, then we say that the representations are guarded against gender with respect to V. The theoretical treatment in our paper specifically focuses on log-linear guardedness, which we take to mean the inability of a log-linear model to recover the erased concept from the representations.We are able to prove that when the downstream classifier is binary valued, such as a binary sentiment classifier, its prediction indeed cannot leak information about the erased concept ( § 3.2) under certain assumptions.On the contrary, in the case of multiclass classification with a log-linear model, we show that predictions can potentially leak a substantial amount of information about the removed concept, thereby recovering the guarded information completely.The theoretical analysis is supported by experiments on commonly used linear erasure techniques ( § 5).While previous authors (Goldfarb-Tarrant et al. 2021, Orgad et al. 2022, inter alia) have empirically studied concept erasure's effect on downstream classifiers, to the best of our knowledge, we are the first to study it theoretically.Taken together, these findings suggest that log-linear guardedness may have limitations when it comes to preventing information leakage about concepts and should be assessed with extreme care, even when the downstream classifier is merely a log-linear model.

Information-Theoretic Guardedness
In this section, we present an information-theoretic approach to guardedness, which we couch in terms of V-information (Xu et al., 2020).

Preliminaries
We first explain the concept erasure paradigm (Ravfogel et al., 2022a), upon which our work is based.Let X be a representation-valued random variable.In our setup, we assume representations are realvalued, i.e., they live in R D .Next, let Z be a binaryvalued random variable that denotes a protected attribute, e.g., binary gender. 2 We denote the two binary values of Z by Z def = {⊥, }.We assume the existence of a guarding function h : R D → R D that, when applied to the representations, removes the ability to predict a concept Z given concept by a specific family of models.Furthermore, we define the random variable Y = t(h(X)) where t : R D → Y def = {0, . . ., |Y|} is a function 3 that corresponds to a linear classifier for a downstream task.For instance, t may correspond to a linear classifier that predicts the sentiment of a representation.
Our discussion in this paper focuses on the case when the function t is derived from the argmax of a log-linear model, i.e., in the binary case we define Y's conditional distribution given h(X) as ) where θ ∈ R D is a parameter column vector, φ ∈ R is a scalar bias term, and 2 Not all concepts are binary, but our analysis in § 2 makes use of this simplifying assumption. 3The elements of Y are denoted y.
And, in the multivariate case we define Y's conditional distribution given h(X) as where y * = argmax y ∈Y (Θ h(x) + φ) y and Θ y ∈ R D denotes the y th column of Θ ∈ R D×K , a parameter matrix, and φ ∈ R K is the bias term.Note K is the number of classes.

V-Information
Intuitively, a set of representations is guarded if it is not possible to predict a protected attribute z ∈ Z from a representation x ∈ R D using a specific predictive family.As a first attempt, we naturally formalize predictability in terms of mutual information.In this case, we say that Z is not predictable from X if and only if I(X; Z) = 0.However, the focus of this paper is on linear guardedness, and, thus, we need a weaker condition than simply having the mutual information I(X; Z) = 0. We fall back on Xu et al.'s (2020) framework of V-information, which introduces a generalized version of mutual information.In their framework, they restrict the predictor to a family of functions V, e.g., the set of all log-linear models.
We now develop the information-theoretic background to discuss V-information.The entropy of a random variable is defined as Xu et al. (2020) analogously define the conditional V-entropy as follows (5) The V-entropy is a special case of Eq. ( 5) without conditioning on another random variable, i.e., Xu et al. (2020) further define the V-information, a generalization of mutual information, as follows In words, Eq. ( 7) is the best approximation of the mutual information realizable by a classifier belonging to the predictive family V. Furthermore, in the case of log-linear models, Eq. ( 7) can be approximated empirically by calculating the negative log-likelihood loss of the classifier on a given set of examples, as H V (Z) is the entropy of the label distribution, and H V (Z | X) is the minimum achievable value of the cross-entropy loss.

Guardedness
Having defined V-information, we can now formally define guardedness as the condition where the V-information is small.Definition 2.1 (V-Guardedness).Let X be a representation-valued random variable and let Z be an attribute-valued random variable.Moreover, let V be a predictive family.A guarding function h ε-guards X with respect to Z over where (x n , z n ) ∼ p(X, Z).Let X and Z be random variables over R D and Z, respectively, whose distribution corresponds to the marginals of the empirical distribution over D. We say that a function h(•) empirically ε-guards D with respect to the family In words, according to Definition 2.2, a dataset is log-linearly guarded if no linear classifier can perform better than the trivial classifier that completely ignores X and always predicts Z according to the proportions of each label.The commonly used algorithms that have been proposed for linear subspace erasure can be seen as approximating the condition we call log-linear guardedness (Ravfogel et al., 2020(Ravfogel et al., , 2022a,b),b).Our experimental results focus on empirical guardedness, which pertains to practically measuring guardedness on a finite dataset.However, determining the precise bounds and guarantees of empirical guardedness is left as an avenue for future research.

Theoretical Analysis
In the following sections, we study the implications of guardedness on subsequent linear classifiers.Specifically, if we construct a third random variable Y = t(h(X)) where t : R D → Y is a function, what is the degree to which Y can reveal information about Z? As a practical instance of this problem, suppose we impose ε-guardedness on the last hidden representations of a transformer model, i.e., X in our formulation, and then fit a linear classifier t over the guarded representations h(X) to predict sentiment.Can the predictions of the sentiment classifier indirectly leak information on gender?For expressive V, the data-processing inequality (Cover and Thomas, 2006, §2.8) applied to the Markov chain X → Y → Z tells us the answer is no.The reason is that, in this case, V-information is equivalent to mutual information and the data processing inequality tells us such leakage is not possible.However, the data processing inequality does not generally apply to V-information (Xu et al., 2020).Thus, it is possible to find such a predictor t for less expressive V. Surprisingly, when |Y| = 2, we are able to prove that constructing such a t that leaks information is impossible under a certain restriction on the family of log-linear models.

Problem Formulation
We first consider the case where |Y| = 2.

A Binary Downstream Classifier
We begin by asking whether the predictions of a binary log-linear model trained over the guarded set of representations can leak information on the protected attribute.Our analysis relies on the following simplified family of log-linear models.
Definition 3.1 (Discretized Log-Linear Models).The family of discretized binary log-linear models with parameter δ ∈ (0, 1) is defined as σ being the logistic function, and where we define the δ-discretization function as In words, ρ δ is a function that maps the probability value to one of two possible values.Note that ties must be broken arbitrarily in the case that p = 1 2 to ensure a valid probability distribution.
Our analysis is based on the following simple observation (see Lemma A.1 in the Appendix) that the composition of two δ-discretized log-linear models is itself a δ-discretized log-linear model.Using this fact, we show that when |Y| = |Z| = 2, and the predictive family is the set of δ-discretized binary log-linear models, ε-guarded representations h(X) cannot leak information through a downstream classifier.
Theorem 3.2.Let V δ be the family of δdiscretized log-linear models, and let X be a  representation-valued random variable.Define Y as in Eq. (1), then Proof.Define the hard thresholding function Assume, by contradiction, that We start by algebraically manipulating I V δ (X → Z): Finally, making use of a change of variable in Eq. ( 1) gives us where θ and φ stem from the definition of t in Eq. ( 1).This chain of equalities gives us that

A Multiclass Downstream Classifier
The above discussion shows that when both Z and Y are binary, ε-log-linear guardedness with respect to the family of discretized log-linear models (Definition 3.1) implies limited leakage of information about Z from Y. It was previously implied (Ravfogel et al., 2020;Elazar et al., 2021) that linear concept erasure prevents information leakage about Z through the labeling of a log-linear classifier Y, i.e., it was assumed that Theorem 3.2 in § 3.2 can be generalized to the multiclass case.Specifically, it was argued that a subsequent linear layer, such as the linear language-modeling head, would not be able to recover the information because it is linear.In this paper, however, we note a key flaw in this argument.If the data is log-linearly guarded, then it is easy to see that the logits, which are a linear transformation of the guarded representation, cannot encode the information.However, multiclass classification is usually performed by a softmax classifier, which adds a non-linearity.Note that the decision boundary of the softmax classifier for every pair of labels is linear since class i will have higher softmax probability than class j if, and only if, Next, we demonstrate that this is enough to break guardedness.We start with an example.Consider the data in R 2 presented in Fig. 1(a), where the distribution p(X, Z) has 4 distinct clusters, each with a different label from Z, corresponding to Voronoi regions (Voronoi, 1908) formed by the intersection of the axes.The red clusters correspond to Z = and the blue clusters correspond to Z = ⊥.The data is taken to be log-linearly guarded with respect to Z.4 Importantly, we note that knowledge of the quadrant (i.e., the value of Y), renders Z recoverable by a 4-class log-linear model.

Assume the parameter matrix
for some α > 0. These directions encode the quadrant of a point: When the norm of the parameter vectors is large enough, i.e., for a large enough α, the probability of class i under a log-linear model will be arbitrarily close to 1 if, and only if, the input is in the i th quadrant and arbitrarily close to 0 otherwise.Given the information on the quadrant, the data is rendered perfectly linearly separable.Thus, the labels Y predicted by a multiclass softmax classifier can recover the linear separation according to Z.
This argument can be generalized to a separation that is not axis-aligned (Fig. 1(b)).
Definition 3.3.Let θ 1 , . . ., θ K be column vectors orthogonal to corresponding linear subspaces, and let R 1 , . . ., R M be the Voronoi regions formed by their intersection (Fig. 1(b)).Let p(X, Z) be any data distribution such that any two points in the same region have the same value of Z: for all (x i , z i ), (x j , z j ) ∼ p(X, Z) and for all Voronoi regions R k .We call such distribution a K-Voronoi distribution.
Theorem 3.4.Fix ε > 0. Let p(X, Z) be a K-Voronoi distribution, and let h linearly ε-guard X against Z with respect to the family V of log-linear models.Then, for every η > 0, there exists a K-class log-linear model such as Proof.By assumption, the support of p(X) is divided up into K Voronoi regions, each with a label from Z. See Fig. 1 for an illustrative example.Define the region identifier ι k (i) for each region k as follows We make the simplifying assumption that points x that lie on line θ i x for any i occur with probability zero.Consider a K-class log-linear model with a parameter matrix Θ ∈ R D×K that contains, in its j th column, the vector θ i.e., we sum over all θ k and give positive weight to a vector θ k if a positive dot product with it is a necessary condition for a point x to belong to the k th Voronoi region.Additionally, we scale the parameter vector by some α > 0. Let x ∈ R j and let R m be a Voronoi region such that j = m.We next inspect the ratio We now show that α( K k=0 (ι j (k) − ι m (k))θ k ) x > 0 through the consideration of the following three cases: • Case 1: ι j (k) = ι m (k).In this case, the subspace θ k is a necessary condition for belonging to both regions j and m.Thus, the summand is zero.
• Case 2: As x ∈ R j , we know that θ k x > 0, and the summand is positive.
• Case 3: As x ∈ R j , we know that θ k x < 0, and the summand is, again, positive.
Since j = m, a summand corresponding to cases 2 and 3 must occur.Thus, the sum is strictly positive.It follows that lim α→∞ r(α) = 1.
Finally, for Y defined as in Eq. ( 3), we have p( Y = j | x ∈ R j ) = 1 for α large.Now, because all points in each R j have a distinct label from Z, it is trivial to construct a binary log-linear model that places arbitrarily high probability on R j 's label, which gives us I V ( Y → Z) > 1 − η for all η > 0 small.This completes the proof.
This construction demonstrates that one should be cautious when arguing about the implications of log-linear guardedness when multiclass softmax classifiers are applied over the guarded representations.When log-linear guardedness with respect to a binary Z is imposed, there may still exist a set of k > 2 linear separators that separate Z.

Accuracy-Based Guardedness
We now define an accuracy-based notion of guardedness and discuss its implications.Note that the information-theoretic notion of guardedness described above does not directly imply that the accuracy of the log-linear model is damaged.To see this, consider a binary log-linear model on balanced data that always assigns a probability of 1 2 + ∆ to the correct label and 1 2 − ∆ to the incorrect label.For small enough ∆, the cross-entropy loss of such a classifier will be arbitrarily close to the entropy log(2), even though it has perfect accuracy.This disparity motivates an accuracy-based notion of guardedness.
We first define the accuracy function as The conditional V-accuracy is defined as The V-accuracy is a special case of Eq. ( 16) when no random variable is conditioned on where we have overloaded to take only two arguments.We can now define an analogue of Xu et al.'s (2020) V-information for accuracy as the difference between the unconditional V-accuracy and the conditional V-accuracy6 Note that the V-accuracy is bounded below by 0 and above by 1 2 .Definition 4.1 (Accuracy-based V-guardedness).Let X be a representation-valued random variable and let Z be an attribute-valued random variable.Moreover, let V be a predictive family.A guarding function h ε-guards X against Z with respect to n=1 where (x n , z n ) ∼ p(X, Z).Let X and Z be random variables over R D and Z, respectively, whose distribution corresponds to the marginals of the empirical distribution over D. A guarding function h empirically ε-guards D with respect to V if When focusing on accuracy in predicting Z, it is natural to consider the independence (also known as demographic parity) (Feldman et al., 2015) of the downstream classifiers that are trained over the representations.
where p(X | Z) is the conditional distribution over representations given the protected attribute.
In Prop.4.4, we prove that if the data is linearly ε-guarded and globally balanced with respect to Z, i.e., if p(Z = ⊥) = p(Z = ) = 1 2 , then the prediction of any linear binary downstream classifier is 4ε independent of Z.Note that is true regardless of any imbalance of the protected attribute Z within each class y ∈ Y in the downstream task: the data only needs to be globally balanced.
Proposition 4.4.Let V be the family of binary log-linear models, and assume that p(X, Z) is globally balanced, i.e., p(Z = ⊥) = p(Z = ) = 1 2 .Furthermore, let h be a guarding function that ε-guards X against Z with respect to V in terms of accuracy (Definition 4.2), i.e., Proof.See App.A.2 for the proof.

Experimental Evaluation
In the empirical portion of our paper, we evaluate the extent to which our theory holds in practice.
Data.We perform experiments on gender bias mitigation on the Bias in Bios dataset (De-Arteaga et al., 2019), which is composed of short biographies annotated by both gender and profession.
We represent each biography with the [CLS] representation in the final hidden representation of pre-trained BERT, which creates our representation random variable X.We then try to guard against the protected attribute gender, which constitutes Z.
Approximating log-linear guardedness.To approximate the condition of log-linear guardedness, we use RLACE (Ravfogel et al., 2022a).The method is based on a minimax game between a loglinear predictor that aims to predict the concept of interest from the representation and an orthogonal projection matrix that aims to guard the representation against prediction.The process results in an orthogonal projection matrix P ∈ R D×D , which, empirically, prevents log-linear models from predicting gender after the linear transformation P is applied to the representations.This process constitutes our guarding function h R .Our theoretical result (Theorem 3.2) only holds for δ-discretized loglinear models.RLACE, however, guards against conventional log-linear models.Thus, we apply δ-discretization post hoc, i.e., after training.

Quantifying Empirical Guardedness
We test whether our theoretical analysis of leakage through binary and multiclass downstream classifiers holds in practice, on held-out data.Profession prediction serves as our downstream prediction task (i.e., our Y), and we study binary and multiclass variants of this task.In both cases, we measure three V-information estimates: • Evaluating I V (X → Z).To compute an empirical upper bound on information about the protected attribute which is linearly extractable from the representations, we train a log-linear model to predict z n from x n , i.e., from the unguarded representations.In other words, this is an upper bound on the information that could be leaked through the downstream classifier Y. • Evaluating I V ( Y a → Z).In addition to the standard scenario estimated by I V ( Y p → Z), we also ask: What is the maximum amount of information that a downstream classifier could leak about gender?I V ( Y a → Z) estimates this quantity, with a variant of the setup of I V ( Y p → Z).Namely, instead of training the two log-linear models separately, we train them together to find the Y a that is adversarially chosen to predict gender well.However, the argmax operation is not differentiable, so we remove it during training.
In practice, this means Y a does not predict profession, but instead predicts a latent distribution which is adversarially chosen so as to best enable the prediction of gender. 7 While high I V ( Y a → Z) indicates that there exists an adversarial log-linear model that leaks information about Z, it does not necessarily mean that common classifiers like those used to compute I V ( Y p → Z) would leak that information.Across all of 3 conditions, we explore how different values of the thresholds δ (applied after training) affect the V-information.Refer to App.A.3 for comprehensive details regarding our experimental setup.

Binary Z and Y
We start by evaluating the case where both Z and Y take only two values each.
Experimental Setting.To create our set Y for the binary classification task, we randomly sample 15 pairs of professions from the Bias in Bios dataset; see App.A.3.We train a binary log-linear 7 Only at inference time do we apply the argmax over the first log-linear model to get a prediction Ya = y.We find that the loss of the composition model is not increased by the argmax operation.model to predict each profession from the representation after the application of the RLACE guarding function, h R (x n ).Empirically, we observe that our log-linear models achieve no more than 2% above majority accuracy on the protected data.For each pair of professions, we estimate three forms of V-information.
Results.The results are presented in Fig. 2, for the 15 pairs of professions we experimented with (each curve is the mean over all pairs), the three quantities listed above, and different values of the threshold δ on the x-axis.Unsurprisingly, we observe that the V-information estimated from the original representations (the red curve) has high values for some thresholds, indicating that BERT representations do encode gender distinctions.The blue curve, corresponding to I V ( Y a → Z), measures the ability of the adversarially constructed binary downstream classifier to recover the gender information.It is lower than the red curve but is nonzero, indicating that the solution found by RLACE does not generalize perfectly.Finally, the orange curve, corresponding to I V ( Y p → Z), measures the amount of leakage we get in practice from downstream classifiers that are trained on profession prediction.In that case, the numbers are significantly lower, showing that RLACE does provide decent guardedness in practice.

Binary Z, Multiclass Y
Empirically, we have shown that RLACE provides good, albeit imperfect, protection against binary log-linear model adversaries.This finding is in line with the conclusions of Theorem 3.2.We now turn to experiments on multiclass classification, i.e., where |Y| > 2. According to § 3.3, to the extent that the K-Voronoi assumption holds, we expect guardedness to be broken with a large enough |Y|.
Experimental Setting.Note that, since |Y| > 2, Y is a multiclass log-linear classifier over Y, but the logistic classifier that predicts gender from the argmax over these remains binary.We consider different values of |Y| = 2, 4, 8, 16, 32, 64.
Results.The results are shown in Fig. 3.For all professions, we find a log-linear model whose predicted labels are highly predictive of the protected attribute.Indeed, softmax classifiers with 4 to 8 entries (corresponding to hidden neurons in the network which is the composition of two log-linear models) perfectly recover the gender information.This indicates that there are labeling schemes of the data using 4 or 8 labels that recover almost all information about Z.

Discussion
Even if a set of representations is log-linearly guarded, one can still adversarially construct a multiclass softmax classifier that recovers the information.These results stem from the disparity between the manifold in which the concept resides, and the expressivity of the (linear) intervention we perform: softmax classifiers can access information that is inaccessible to a purely linear classifier.Thus, interventions that are aimed at achieving guardedness should consider the specific adversary against which one aims to protect.

Related Work
Techniques for information removal are generally divided into adversarial methods and post-hoc linear methods.Adversarial methods (Edwards and Storkey, 2016;Xie et al., 2017;Chen et al., 2018;Elazar and Goldberg, 2018;Zhang et al., 2018) use a gradient-reversal layer during training to induce representations that do not encode the protected attribute.However, Elazar and Goldberg (2018) have shown that these methods fail to exhaustively remove all the information associated with the protected attribute.Linear methods have been proposed as a tractable alternative, where one identifies a linear subspace that captures the concept of interest, and neutralizes it using algebraic techniques.Different methods have been proposed for the identification of the subspace, e.g., PCA and variants thereof (Bolukbasi et al., 2016;Kleindessner et al., 2023), orthogonal rotation (Dev et al., 2021), classification-based (Ravfogel et al., 2020), spectral (Shao et al., 2023a,b) and adversarial approaches (Ravfogel et al., 2022a).Different definitions have been proposed for fairness (Mehrabi et al., 2021), but they are mostly extrinsic-they concern themselves only with the predictions of the model, and not with its representation space.Intrinsic bias measures, which focus on the representation space of the model, have been also extensively studied.These measures quantify, for instance, the extent to which the word representation space encodes gender distinctions (Bolukbasi et al., 2016;Caliskan et al., 2017;Kurita et al., 2019;Zhao et al., 2019).The relation between extrinsic and intrinsic bias measures is understudied, but recent works have demonstrated empirically either a relatively weak or inconsistent correlation between the two (Goldfarb-Tarrant et al., 2021;Orgad et al., 2022;Cao et al., 2022;Orgad and Belinkov, 2022;Steed et al., 2022;Shen et al., 2022;Cabello et al., 2023).

Conclusion
We have formulated the notion of guardedness as the inability to directly predict a concept from the representation.We show that log-linear guardedness with respect to a binary protected attribute does not prevent a subsequent multiclass linear classifier trained over the guarded representations from leaking information on the protected attribute.In contrast, when the main task is binary, we can bound that leakage.Altogether, our analysis suggests that the deployment of linear erasure methods should carefully take into account the manner in which the modified representations are being used later on, e.g., in classification tasks.

Limitations
Our theoretical analysis targets a specific notion of information leakage, and it is likely that it does not apply to alternative ones.While the V-informationbased approach seems natural, future work should consider alternative extrinsic bias measures as well as alternative notions of guardedness.Additionally, our focus is on the linear case, which is tractable and important-but limits the generality of our conclusions.We hope to extend this analysis to other predictive families in future work.

Ethical Considerations
The empirical experiments in this work involve the removal of binary gender information from a pretrained representation.Beyond the fact that gender is a non-binary concept, this task may have realworld applications, in particular such that relate to fairness.We would thus like to remind the readers to take the results with a grain of salt and be extra careful when attempting to deploy methods such as the one discussed here.Regardless of any theoretical result, care should be taken to measure the effectiveness of bias mitigation efforts in the context in which they are to be deployed, considering, among other things, the exact data to be used and the exact fairness metrics under consideration.
This proves the result.
Proof.In the following proof, we use the notation X = x for the guarded variable h(X) = h(x) to avoid notational cutter.Assume, by way of contradiction, that the L 1 independence gap (Eq.( 19)), y E Then, there exists a y ∈ Y such that We will show that we can build a classifier q ∈ V that breaks the assumption I A V (h(X) → Z) < ε.Next, we define the random variable Z q for convenience as In words, Z q is a random variable that ranges over possible predictions, derived from the argmax, of the binary log-linear model q.Now, consider the following two cases.
• Case 1: There exits a y such that E Let Y be defined as in Eq. ( 1).Next, consider a random variable Z r defined as follows Now, note that we have and We perform the algebra below where the step from Eq. (37c) to Eq. (37d) follows because of the fact that, despite the nuisance variable Y, the decision boundary of p( Z r = | X = x) is linear and, thus, there exists a binary log-linear model in V which realizes it.Now, consider the following steps • Case 2: There exits a y such that Let Y be defined as in Eq. (1).Next, consider a random variable defined as follows Now, note that we have and We proceed by algebraic manipulation In both cases, we have A V (Z | X = x) > 1 2 + ε.Thus, E x∼p(X) A V (Z | X = x) = A V (Z | h(X)) ≥ 1 2 + ε.Note that the distribution p(Z, X) is globally balanced, we have A V (Z) = 1 2 .Thus, However, this contradicts the assumption that I A V (h(X) → Z) < ε.This completes the proof.

A.3 Experimental Setting
In this appendix, we give additional information necessary to replicate our experiments ( § 5).
Data.We use the same train-dev-test split of the biographies dataset used by Ravfogel et al. (2020), resulting in training, evaluation and test sets of sizes 255,710, 39,369, and 98,344, respectively.We reduce the dimensionality of the representations to 256 using PCA.The dataset is composed of short biographies, annotated with both gender and profession.We randomly sampled 15 pairs of professions from the dataset: (professor, attorney), (journalist, surgeon), (physician, nurse), (professor, physician), (psychologist, teacher), (attorney, teacher), (physician, journalist), (professor, dentist), (teacher, surgeon), (psychologist, surgeon), (photographer, surgeon), (attorney, psychologist), (physician, teacher), (professor, teacher), (professor, psychologist) Optimization.We run RLACE (Ravfogel et al., 2022a) with a simple SGD optimization, with a learning rate of 0.005, a weight decay of 1e −5 and a momentum of 0.9, chosen by experimenting with the development set.We use a batch size of 128.The algorithm is based on an adversarial game between a predictor that aims to predict gender, and an orthogonal projection matrix adversary that aims to prevent gender classification.We choose the adversary which yielded highest classification loss.All training is done on a single NVIDIA GeForce GTX 1080 Ti GPU.
Estimating V-information.After running RLACE, we get an approximately linearly-guarded representation by projecting x n ← P x n , where P is the orthogonal projection matrix returned by RLACE.We validate guardedness by training log-linear models over the projected representations; they achieve accuracy less than 2% above the majority accuracy.Then, to estimate I V ( Y a → Z), we fit a simple neural network of the form of a composition of two log-linear models.The inner model either has a single hidden neuron with a logistic activation (in the binary experiment), or K = 2, 4, 8, 16, 32, 64 hidden neurons with softmax activations, in the multiclass experiment ( § 5.3).The networks are trained end to end to recover binary gender for 25000 batches of size 2048.Optimization is done with Adam with the default parameters.We use the loss of the second log-linear model to estimate I V ( Y a → Z), according to Definition 2.2.
(a) Log-linearly guarded data in R 2 with axis-aligned clusters.Log-inearly guarded data in R 2 with clusters that are not axis-aligned.

Figure 1 :
Figure 1: Construction of a log-linear model that breaks log-linear guardedness.
Definition 4.3.The L 1 independence gap measures the difference between the distribution of the model's predictions on the examples for which Z = ⊥, and the examples for which Z = .It is formally defined as

Figure 2 :
Figure2: Results for § 5.2.Estimate of V-information between the protected attribute and (1) the original representations (red); (2) the labels induced by the inner model within a composition of two log-linear models, trained to adversarially recover gender (blue); (3) labels for the downstream task (the predictions of profession classifiers; orange).The curve is the mean over different pairs of professions, and the shaded area representations 1 standard deviation.The x-axis presents results for different values of the threshold δ.Recall the threshholding is applied post hoc.

Figure 3 :
Figure 3: Results for § 5.3.Estimate of V-information between the protected attribute and Y a with various δ.