The Impact of Differential Privacy on Group Disparity Mitigation

The performance cost of differential privacy has, for some applications, been shown to be higher for minority groups fairness, conversely, has been shown to disproportionally compromise the privacy of members of such groups. Most work in this area has been restricted to computer vision and risk assessment. In this paper, we evaluate the impact of differential privacy on fairness across four tasks, focusing on how attempts to mitigate privacy violations and between-group performance differences interact Does privacy inhibit attempts to ensure fairness? To this end, we train epsilon, delta-differentially private models with empirical risk minimization and group distributionally robust training objectives. Consistent with previous findings, we find that differential privacy increases between-group performance differences in the baseline setting but more interestingly, differential privacy reduces between-group performance differences in the robust setting. We explain this by reinterpreting differential privacy as regularization.


Introduction
Classification tasks in computer vision and natural language processing face the challenge of balancing performance with the need to prevent discrimination against protected demographic subgroups, satisfying fairness principles.In some tasks, we train our classifiers on private data and therefore also need our models to satisfy privacy guarantees.
Privacy-preserving algorithms, however, may tend to disproportionally affect members of minority classes (Farrand et al., 2020).E.g., Bagdasaryan et al. (2019), show the performance cost of differential privacy (Dwork et al., 2006) in face recognition is higher for minority groups, suggesting that privacy and fairness are fundamentally at odds (Chang and Shokri, 2021;Agarwal, 2021).In this paper, * Authors contributed equally we evaluate two hypotheses at scale: (a) that the performance cost of differential privacy is unevenly distributed across demographic groups (Ekstrand et al., 2018;Cummings et al., 2019;Bagdasaryan et al., 2019;Farrand et al., 2020), and (b) that such effects can be mitigated by more robust learning objectives (Sagawa et al., 2020a;Pezeshki et al., 2020).
Contributions We build upon previous work suggesting that differential privacy and fairness are at odds: Differential privacy may often hurt minority groups the most, and reducing the fairness gap by focusing on minority groups during training typically puts their privacy at risk.We evaluate this hypothesis at scale by measuring the impact of differential privacy in terms of fairness across (1) a baseline empirical risk minimization and (2) under a group distributionally robust optimization.We conduct our experiments across four tasks of different modalities, assuming the group membership information is available at training time, but not at test time: face recognition (CelebA), topic classification, volatility forecasting based on earning calls, and sentiment analysis of product reviews.Our results confirm that differential privacy compromises fairness in the baseline setting; however, we demonstrate that differential privacy not only mitigates the decrease but also improves fairness compared to non-private experiments for 4/5 datasets in the distributionally robust setting.We explain this by reinterpreting differential privacy as an approximation of Gaussian noise injection, which is equivalent to strategies previously shown to determine the efficacy of group-robust learning.

Fairness and Privacy
Fair machine learning aims to ensure that induced models do not discriminate against individuals with specific values in their protected attributes (e.g., race, gender).We represent each data point as arXiv:2203.02745v1[cs.CR] 5 Mar 2022 z = (x, g, y) ∈ X × G × Y, with g ∈ G encoding its protected attribute(s).1Let D g y denote the distribution of data with protected attribute g and label y.
Several definitions of group fairness exist in the literature (Williamson and Menon, 2019), but here we focus on a generalization of approximately constant conditional (equalized) risk (Donini et al., 2018): 2 Definition 2.1 (∆-Fairness).Let g i (θ) = E[ (θ(x), y)|g = g i ] be the risk of the samples in the group defined by g i , and ∆ ∈ [0, 1].We say that a model θ is ∆-fair if for any two values of g, say g i and g j , Note that if coincides with the performance metric of a task, and δ = 0, this is identical to performance or classification parity (Yuan et al., 2021). 3Such a notion of fairness can be derived from John Rawls' theory on distributive justice and stability, treating model performance as a resource to be allocated.Rawls' difference principle, maximizing the welfare of the worst-off group, is argued to lead to stability and mobility in society at large (Rawls, 1971).∆ directly measures what is sometimes called Rawlsian min-max fairness (Bertsimas et al., 2011).In our experiments, we measure ∆-fairness as the absolute difference between performance of the worst-off and best-off subgroups.
Recall the standard definition of (ε, δ)-privacy: )] + δ for any two distributions, X and X , different at most in one row.
Differential privacy thereby ensures that an algorithm will generate similar outputs on similar data sets.Note the multiplicative bound exp(ε) and the additive bound δ serve different roles: The δ term represents the possibility that a few data points are not governed by the multiplicative bound, which controls the level of privacy (rather than its scope).
Note that it also follows directly that if ε = 0 and δ = 0, absolute privacy is required, leading θ to be independent of the data.
Several authors have shown that differential privacy comes at different costs for minority subgroups (Ekstrand et al., 2018;Cummings et al., 2019;Bagdasaryan et al., 2019;Farrand et al., 2020).The more private the model is required to be, the larger group disparities it may exhibit4 .This happens because differential privacy distributes noise where it is needed to reduce the influence of individual examples.Since outlier examples are likely to have disproportional influence on output distributions (Campbell, 1978;Chernick and Murthy, 1983), they are also disproportionally affected by noise injection in differential privacy.Agarwal (2021) show that, in fact, a (ε, 0)private and fully fair model -using equalized odds as the definition of fairness -will be unable to learn anything.To see this, remember that a fully private model is independent of the data and unable to learn from correlations between input and output.If θ is, in addition, required to be fair, it is thereby required to be fair for all distributions, which prevents θ from encoding any prior beliefs about the output distribution.Note this finding generalizes straight-forwardly to equalized risk, and even to approximate fairness (since even for finite distributions, we can define a ∆ > 0, such that preserving absolute privacy would lead to a constant θ).
Theorem 1.For sufficiently small values of ∆, a fully (ε, 0)-private model θ that is also ∆-fair, will have trivial performance.
Proof.This follows directly from the above.
While we do not strictly require an absolute privacy in our experiments (setting δ = 10 −5 ), intuitively, privacy potentially compromises fairness by adding more noise to data points of minority group members than to those of majority groups.Fairness, on the other hand, leads to over-sampling or over-attending to data points of minority group members, more likely compromising their privacy.Pannekoek and Spigler (2021) show, however, that it is possible to learn somewhat private and somewhat fair classifiers.They combine differential privacy with reject option classification.Their results nevertheless suggest that privacy and fairness objectives are fundamentally at odds, as fairness decreases with the introduction of differential privacy.

Experiments
This section describes the algorithms and datasets involved in our experiments, and presents the results of these.

Algorithms
Empirical Risk Minimization For a model parameterized by θ, in our baseline Empirical Risk Minimization (ERM) setting, we minimize the expected loss E[ (θ(x), y)] with data (x, g, y) ∈ X × G × Y drawn from a dataset D: (1) Here D denotes the empirical training distribution.Note that we disregard any group information in our data.In an overparameterized setting, ERM is prone to overfitting spurious correlations, which are more likely to hurt performance on minority groups (Sagawa et al., 2020b).
Distributionally Robust Optimization Several authors have suggested to mitigate the effects of such overfitting by explicitly optimizing for out-ofdistribution mixtures of sub-populations (Hu et al., 2018;Oren et al., 2019;Sagawa et al., 2020a).In this work we focus on Group-aware Distributionally Robust Optimization (Group DRO) (Sagawa et al., 2020a).Under the assumption that the training distribution D is a mixture of a discrete number of groups, D g for g ∈ G, we define the worst-case loss as the maximum of the group-specific expected losses: In Group DRO -in contrast with ERM -we exploit our knowledge of the group membership of data points (x, g, y).The overall objective is for minimizing the empirical worst-case loss is therefore: (3) Note, again, that the knowledge of group membership g is only available at training time, not at test time.Unlike Sagawa et al. (2020a), we do not employ heavy 2 regularization during our experiments, but rather use it with the same parameters as proposed in Koh et al. (2021).
Differentially Private Stochastic Gradient Descent (DP-SGD) We implement differential privacy (Dwork et al., 2006) using DP-SGD, as presented in Abadi et al. (2016).DP-SGD limits the influence of training samples by (i) clipping the per-batch gradient where its norm exceeds a predetermined clipping bound C, and by (ii) adding Gaussian noise N characterized by a noise scale σ to the aggregated per-sample gradients.We control this influence with a privacy budget ε, where lower values for ε indicates a more strict level of privacy.DP-SGD has remained popular, among other things because it generalizes to iterative training procedures (McMahan et al., 2018), and supports tighter bounds using the Rényi method (Mironov, 2017).
Differential privacy generally comes at a performance cost, leading to privacy-preserving models performing worse compared to their non-private counterparts (Alvim et al., 2011).However, we follow Kerrigan et al. (2020) and finetune the private models, which are first pretrained (without differential privacy) on a large public dataset.This protocol generally seems to provide a better tradeoff between accuracy and privacy (Kerrigan et al., 2020), leading to better-performing, yet private models.The only exception to this setup is the volatility forecasting task, where our models were trained from scratch, as those rely on PRAAT audio features.

Tasks and architectures
To study the impact of differential privacy on fairness, in ERM and Group DRO, we evaluate increasing levels of differential privacy across five datasets that span four tasks and three different modalities: speech, text and vision.

Facial Attribute Detection
We study facial attribute recognition with the CelebFaces Attributes Dataset (CelebA) (Liu et al., 2015).It contains faces of celebrities annotated with attributes, such as hair color, gender and other facial features.Following Sagawa et al. (2020a), we use the hair color as our target variable, with gender being the demographic attribute (see Figure 1 (left)).The dataset contains ∼ 163K datapoints, where the smallest group (blond males) only counts 1387.We finetune a publicly pretrained ResNet50, a standard model for image classification tasks, on the CelebA dataset and evaluate model performances as accuracies over 3 individual seeds.
Topic Classification For topic classification, we use the Blog Authorship Corpus (Schler et al., 2006).The Blog Authorship Corpus contains weblogs written on 19 different topics, collected from the Internet before August 2004.The dataset contains self-reported demographic information about the gender and age of the authors.The dataset is limited to gender information in the binary form5 .We binarize age, distinguishing between young (=< 35) and older (> 35) authors,6 resulting in four different group combinations (see Figure 1 (right)).We chose two topics of roughly equal size (Technology and Arts), reducing the topic classification task to a binary classification task.For our experiments, we finetune a pretrained English Dis-tilBERT model (Sanh et al., 2019).To reduce the overall added computational cost of DP-SGD, we freeze our model, except for the outer-most Transformer encoder layer as well as the classification layer.We report model performances as F1 scores over 3 individual seeds.
Volatility Forecasting For the stock volatility forecasting task, we use the Earnings Conference Calls dataset by Qin and Yang (2019).This consists of 559 public earnings calls audio recordings for 277 companies in the S&P 500 index, spanning over a year of earnings calls.The self-reported genders of the CEOs was scraped by Sawhney et al. ( 2021) from Reuters,7 Crunchbase,8 and the WikiData API. 9 The extracted genders were found to be in the binary form (Sawhney et al., 2021), with 12.3% of speakers being female and 87.7% of speakers being male, a highly skewed distribution.
Since our primary focus with this task is to explore the impact of differential privacy on speech, we use only audio features without the call transcripts.For each audio recording A of a given earning call E, the goal is to predict the company's stock volatility as a regression task.Following Qin and Yang (2019), we calculate the average log volatility τ days (temporal window) following the day of the earnings call.For each audio clip belonging to a given call, we extract 26-dimensional features with PRAAT (Boersma and Van Heuven, 2001).Each audio embedding of the call is fed sequentially to a BiLSTM, followed by an attention layer and two fully-connected layers.The model is trained by optimizing the Mean Square Error (MSE) between the predicted and true stock volatility.For all results, we report MSE on the test set for a 70:10:20 temporal split of the data.The results are averaged over 5 seeds.
Sentiment Analysis For our sentiment analysis task, we use the Trustpilot Corpus (Hovy et al., 2015) 10 .It consists of text-based user reviews from the Trustpilot website, rating companies and services on a 1 to 5 star scale.The reviews spans 5 different countries; Germany, Denmark, France, the DP-SGD implementation provided by the Opacus Differential Privacy framework 12 .For further details about data and training, see §6.2 in the Appendix.We release the code for our experiments at: https://github.com/vpetren/fair_dp.

Results
Our results are presented in Table 1.The top half of the table presents standard (average) performance numbers across multiple runs of ERM and Group DRO at different privacy levels.Recall that performance for sentiment analysis as well as topic classification is measured in F1, volatility forecasting is measured in MSE and face recognition is measured in accuracy.The accuracy of our ERM face attribute detection classifier is 0.954 in the non-private setting, for example.Our first observation is that, as hypothesized earlier, differential privacy hurts model performance.For our smallest text-based dataset (T-US), performance becomes very poor at the strictest privacy level.This is however associated with a high amount of variance between seeds, see Figure 5 in the Appendix.The above face attribute detection classifier, which had an accuracy of 0.954 in the non-private setting, has a performance of 0.932 at this level.

Differential privacy hurts fairness in ERM
The effect on differential privacy on fairness (bottom half of Table 1) is also quite consistent.The gap between the majority group and the minority group (or, more precisely, the best-performing and the worst-performing demographic subgroup) widens with increased privacy.In face recognition, for example, the accuracy gap between the two groups is 0.556 without differential privacy, but 0.770 at the strictest privacy level.

Differential privacy increases fairness in Group
DRO For Group DRO, we see the opposite effect.For 4/5 datasets, we see that differential privacy leads to an increase in fairness.For face recognition, for example, the gap goes from 0.514 in the non-private setting to 0.056 in the strictest, basically disappearing.This is also illustrated in the bar plots in Figure 2. See Figure 3 for similar bar plots of the topic classification results; we include similar plots for other tasks in the Appendix.We do also observe that this increase in privacy can be expensive in terms of overall performance (e.g.Trustpilot-US).Note that the increase in fairness at Figure 4: Volatility Forecasting: A comparison of group-disparity between subgroups for increasing temporal volatility windows (τ ) and privacy budgets (ε), over 5 independent runs.higher privacy levels is seemingly at odds with previous results suggesting that privacy and fairness conflict, e.g., Agarwal (2021).We return to this question in §3.4.
Note also that the only exception to the latter trend is for volatility forecasting, where differential privacy hurts fairness both in ERM and Group DRO (though Group DRO mitigates the disparity).This speech-based prediction is the only regression task, and the only task for which we do not rely on pretrained models trained on public data.
For this task, we further analyze group disparity for varying temporal windows (τ ) used to calculate target volatility values, along with increasingly strict privacy budgets (ε) in Figure 4.The disparity between subgroups widens with stricter privacy guarantees (Bagdasaryan et al., 2019).This gap is significant for lower values of τ , strengthening the hypothesis that short-term volatility forecasting is much harder than long-term (Qin and Yang, 2019), especially for minority classes due to the disproportionate impact of noise.Comparing ERM and Group DRO, we find Group DRO mitigates this disparity gap.We observe disparity reduces with increasing temporal window, since stock prices over a larger time frame are comparatively more stable (Qin and Yang, 2019).As a consequence, the influence of Group DRO for higher τ (6, 7) is reduced, despite facilitating faster convergence.Most importantly, we observe the power of Group DRO in mitigating the disparity caused by strict privacy safeguards (ε = 0.96) for crucial short term prediction (τ = 3) tasks.

Discussion
It is well-known that differential privacy comes with a performance cost (Shokri and Shmatikov, 2015). 13However, recent work has additionally shown that differential privacy is at odds with most, if not all, definitions of fairness, including equalized risk (Ekstrand et al., 2018;Cummings et al., 2019;Bagdasaryan et al., 2019;Farrand et al., 2020).Our work makes two important contributions: (a) We evaluate and confirm this hypothesis at a larger scale than previous studies for standard empirical risk minimization; and (b) we point out that the opposite holds true in the context of Group Distributionally Robust Optimization: Here, adding differential privacy improves fairness (equalized risk).
While (b) at first seems to contradict the very hypothesis that (a) confirms -namely that privacy is at odds with fairness -we believe the explanation is quite simple, namely that we are observing two opposite trends (at the same time): On one hand, differential privacy adds disproportionate noise to minority group examples; but on the other hand, it adds Gaussian noise which acts as a regularizer to improve robust optimization.
In their evaluation of Group Distributionally Robust Optimization, Sagawa et al. (2020a) observe that robustness is only achieved in the context of heavy regularization; specifically, they show fairness improvements when they add 2 regularization or early stopping.The 2 regularization and early stopping did not increase fairness under ERM, but seemed to 'activate' Group DRO.This makes intuitive sense: Since regularized models cannot perfectly fit the training data, heavily regularized Group DRO sacrifices average performance for worst-case performance and obtain better generalization.In the absence of regularization, however, Group DRO is less effective.
In our experiments ( §3), we add minimal regularization to Group DRO, following the implementation in Koh et al. ( 2021), but differential privacy, we argue, provides that additional regularization.To see this, remember that DP-SGD works by Gaussian noise injection.Gaussian noise injection is known to be near-equivalent to 2 -regularization and early stopping (Bishop, 1995).DP-SGD simply makes the trade-off more urgent.

Related Work
Fair machine learning Early work on mitigating group-level disparities included oversampling (Shen et al., 2016;Guo and Viktor, 2004) and undersampling (Drumnond, 2003;Barandela et al., 2003), as well as instance weighting (Shimodaira, 2000).Other proposals modify existing training algorithms or cost functions to obtain fairness (Khan et al., 2017;Chung et al., 2015).In the context of large-scale deep neural networks, Group DRO is a particularly interesting approach to mitigating group-level disparities (Creager et al., 2021).See Williamson and Menon (2019) and Corbett-Davies and Goel (2018) for interesting discussions of how fairness has been measured.More recent alternatives to Group DRO include Invariant Risk Minimization (Arjovsky et al., 2020), Spectral Decoupling (Pezeshki et al., 2020) and Adaptive Risk Minimization (Zhang et al., 2021).We ran experiments with both Invariant Risk Minimization and Spectral Decoupling, but they performed much worse than Group DRO.
Fairness and privacy Recent studies suggest that privacy-preserving methods such as differential privacy tend to disproportionately affect minority class samples (Ekstrand et al., 2018;Cummings et al., 2019;Bagdasaryan et al., 2019;Farrand et al., 2020).Pannekoek and Spigler (2021) show that it is possible to learn somewhat private and somewhat fair classifiers, in their case by combining differential privacy and reject option classification.Jagielski et al. (2019) introduced the so-called DP-oracle-learner, derived from an oracle-efficient algorithm (Agarwal et al., 2018), which satisfies equalized odds, an alternative no-tion of fairness (Williamson and Menon, 2019).Lyu et al. (2020) introduced Differentially Private GANs (DPGANs), while Tran et al. (2020) utilize Lagrangian duality to integrate fairness constraints to protected attributes.Group DRO has, to the best of our knowledge, not been studied under differential privacy before.

Conclusions
In §2, we summarized previous work suggesting that differential privacy and fairness are at odds.In §3, we then confirmed this hypothesis at scale, across five datasets, spanning four tasks and three modalities, showing that for Empirical Risk Minimization, stricter levels of privacy consistently hurt fairness.This holds true even after pretraining on large-scale public datasets (Kerrigan et al., 2020).In the context of Group-aware Distributionally Robust Optimization (Group DRO) (Sagawa et al., 2020a), however, which is designed to mitigate group-level performance disparities (optimizing for equalized risk), we saw the opposite effect: Strict levels of differential privacy were associated with an increase in fairness.In §3.4,we discuss how this aligns well with the observation that Group DRO works best in the context of heavy 2 regularization, keeping in mind that Gaussian noise injection is near-equivalent to 2 regularization (Bishop, 1995).

Appendix 6.1 Additional Figures
This section contains group-specific bar-plots for the performance on individual groups in the Trustpilot Corpus.For barplots on CelebA and Blog Authorship, see Figure 2 and 3.

Experimental Details
This section contains additional details surrounding the experiments described in §3.The target stock volatility variable is calculated following (Kogan et al., 2009;Qin and Yang, 2019), defined by:

CelebA
Here r t is the return price at day t and r the mean of return prices over the period of t − τ to t.We refer to τ as the temporal volatility window in our experiments.The return price r t is defined as r t = Pt P t−1 − 1 where P t is the closing price on day t.For validation as well as testing, we withhold 200 samples from each demographic with an even distribution among the ratings (1 to 5).The review scores are then binarized by grouping positive (4 and 5 stars) and negative (1 and 2 stars) and discarding neutral ones (3 stars).For a similar use of this binarization scheme, see Gupta et al. (2020) and Desai et al. (2019).See the group distributions for the training data in Table 5 and 6 for the US and UK tasks respectively.DistilBERT DistilBERT is a small Transformer model trained by distilling BERT (Devlin et al., 2019) (bert-base-uncased).It has 3/5th of the parameters of bert-base-uncased, runs 60% faster, while preserving over 95% of the performance of bert-base-uncased, as measured on the GLUE language understanding benchmark (Wang et al., 2018).We finetune DistilBERT on the Trustpilot corpus and Blog Authorship corpus for 20 epochs each, using a batch size of 8, accumulating gradient for a total virtual batch size of 16 using the built in Opcaus functionality.We limit the number of tokens in a sequence to 256 and use a learning rate of 5e −4 with the AdamW optimizer in addition to a weight decay of 0.01.Otherwise we use the default parameters defined in the Huggingface Transformers python package (version 4.4.2).The models are trained using a single Nvidia TitanRTX GPU and each configuration takes between 5 and 14 hours to run, depending on the size of that dataset and if DP is used or not.We run 3 individual seeds for each configuration.
In our differentially private experiments with DistilBERT (i.e.Blog Authorship and Trustpilot), we fix the gradient clipping C to 1.2 and by specifying various target levels of ε ∈ {1, 5, 10} a corresponding noise multiplier σ is computed with the Opacus framework, based on the batch size and number of training epochs.
Resnet50 ResNet50 is a variant of the ResNet model (He et al., 2015), which has 48 convolution layers along with 1 max pooling and 1 average pooling layer.It has 3.8 x 10 9 floating points operations.
We finetune our Resnet50 model on the CelebA dataset for 20 epochs using a batch size of 64.We optimize the model using standard stochastic gradient descent (SGD) with a learning rate of 1e −3 , momentum of 0.9 and no weight decay.We train our models using a single Nvidia TitanRTX GPU and each configuration takes between 6 and 8 hours to run, depending on if DP is used or not.We run 3 individual seeds for each configuration.
As with the differentially private DistilBERT experiments, we also here fix the gradient clipping C to 1.2 and by specifying various target levels of ε ∈ {1, 5, 10} a corresponding noise multiplier σ is computed with the Opacus framework, based on the batch size and number of training epochs.

Figure 1 :
Figure1: Examples of the different subgroups that appear in a subset of the datasets we train on.CelebA (left) contains images of celebrities, using hair-color as our target variable and gender as our protected attribute.Blog Authorship Corpus (right) contains text-based blogposts on two topics {Technology, Arts} our targets, using G : {Man, Woman} × {Young, Old} as our protected subgroups.

Figure 2 :Figure 3 :
Figure2: Face Attribute Detection: Performance of individual groups of increasing levels of ε.Comparing baseline ERM to Group DRO, we find that Group DRO performance on the minority group (blond males) perform much better under privacy constraints; we return to this in §3.4.

Figure 6 :
Figure 6: Performance of individual groups of increasing levels of ε for the Trustpilot-UK corpus.Error bars show standard deviation over 3 individual seeds.
Liu et al. (2015)20a)ssed version of the CelebA dataset asSagawa et al. (2020a)andKoh et al. (2021), that is, we use the same train/val/test splits asLiu et al. (2015)with the Blond Hair attribute as the target with the Male attribute being the spuriously correlated variable.See group distribution in the training data in Table2.

Table 2 :
Group distribution in the training set of CelebA

Table 3 :
Group distribution in the training set of Blog

Table 4 :
Group distribution in the training set of Earnings Conference CallsTrustpilot We only include the datapoints that contains complete demographic attributes, i.e. the gender, age and location, but as with our topic classification experiments, we only study the group that we can define based on age and gender.All attributes are self-reported.For training we divide the reviews into the four resulting groups (Old-Man, Young-Woman, etc.) and downsample the largest groups to match the size of the smallest group.