Is the Lottery Fair? Evaluating Winning Tickets Across Demographics

Recent studies have suggested that weight pruning, e.g. using lottery ticket extraction techniques (Frankle and Carbin, 2018), comes at the risk of compromising the group fairness of machine learning models (Paganini, 2020; Hooker et al., 2020), but to the best of our knowledge, no one has empirically evaluated this hypothesis at scale in the context of nat-ural language processing. We present experiments with two text classiﬁcation datasets annotated with demographic information: the Trustpilot Corpus (sentiment) and CivilCom-ments (toxicity). We evaluate the fairness of lottery ticket extraction through layer-wise and global weight pruning across three languages and two tasks. Our results suggest that there is a small increase in group disparity, which is most pronounced at high pruning rates and correlates with instability. The fairness of models trained with distribution-ally robust optimization objectives is some-times less sensitive to pruning, but results are not consistent. The code for our experiments is available at https://github. com/vpetren/fairness_lottery .


Introduction
Heavily pruning deep neural network models is a way of reducing inference cost for resourceconstrained environments, but does weight-pruning of deep neural networks increase their unfairness? Several recent papers suggest this (Paganini, 2020;Hooker et al., 2020), based on experiments from face and digit recognition, but does this also hold for natural language processing (NLP) models? Systematic biases may easily be exacerbated by pruning interventions in high-dimensional problems because of feature swamping effects (Sutton et al., 2006). Overparameterized deep neural networks generalize well, in part because they can hedge their bets and rely on multitudes of weak Figure 1: Fairness Sensitivity to Pruning (FSP): the gradient of the linear fit of (the logarithm of) the pruning ratio to min-max group-level disparity. We use this to quantify the sensitivity of Rawlsian min-max fairness to weight pruning across architectures, pruning strategies and datasets. evidence rather than the most prominent independent variables. Sparse models do not have that luxury and are therefore more sensitive to shifts (Globerson and Roweis, 2006;Søgaard, 2013).
We introduce a fairness sensitivity to pruning metric that measures how Rawlsian min-max fairness across demographic groups changes with weight pruning. We estimate this sensitivity by taking the gradient of the linear fit of the logarithm of the pruning ratio to min-max group-level disparity. We show that across four datasets, fairness sensitivity to pruning is similar for layer-wise and global pruning strategies (Frankle and Carbin, 2018), as well as for text classifiers based on feedforward and recurrent neural networks. Subsequently, we consider the impact of a popular robust optimization strategy designed to improve the fairness of classification models (Hashimoto et al., 2018;Sagawa et al., 2020b), on the fairness sensitivity of feed-forward networks.

Contributions
We are, to the best of our knowledge, the first to study the impact of weight pruning on fairness in NLP at scale. We introduce a fairness sensitivity to pruning (FSP) metric that measures how Rawlsian min-max fairness across demographic groups decreases with weight pruning. We evaluate FSP across two architectures, two pruning strategies and two datasets, including multilingual sentiment classification and English toxicity classification. Our results suggest that pruning increases group-level performance disparities, but mostly at high pruning rates and with some variance across architectures and pruning strategies. Group-level disparities seem to be in part a result of the instability of weight pruning. We compare FSP between our baseline empirical risk models and robust models induced with Distributional Robust Optimization (DRO) (Hashimoto et al., 2018;Sagawa et al., 2020b). Our results show that weight pruning in combination with DRO can sometimes (8/16 cases here) be used to induce fairer, sparse classifiers, but the effect is not significant (p ∼ 0.18) across our experiments.
Fairness in pruned models Measuring fairness in pruned models is an unexplored area. However, Paganini (2020) evaluates the fairness, i.e., the difference between the best-and worst-case groups, of lottery ticket-style weight pruning for digit recognition problems: Specifically, they retrain models for a fixed number of iterations using global unstructured pruning. In addition, they present a meta-regression study suggesting that underrepresented and more complex classes are most severely affected by pruning procedures. See Hooker et al. Improving fairness Fairness of overparameterized models can be improved by distributionally robust optimization (DRO) (Hashimoto et al., 2018;Levy et al., 2020), or to some extent by simpler post-hoc correction methods such as classifier retraining or group-specific classification thresholds (Menon et al., 2021). DRO minimizes the worstcase expected loss over an uncertainty set of distributions. The uncertainty set represents the distributions we want our model to perform well on. In Sagawa et al. (2020a), the uncertainty set is all possible mixtures of a known set of groups, a variant referred to as Group DRO. Sagawa et al. (2020b) find that subsampling the majority groups can be a way for overparameterized models to achieve both low minority test error as well as low average test error.

Pruning methodology
We extract winning lottery tickets from our network according to the iterative procedure outlined in Frankle and Carbin (2018): Given a model f (x; θ) with initial network parameters θ 0 and mask m 0 , for each pruning iteration i, we start by initializing a model f (x; θ) with initial parameter θ 0 and train it for N epochs, resulting in f (x; θ N ). After training, we prune a fixed fraction p ∈ [0, 1] from the remaining parameters in θ N to obtain the mask m i . The pruned weights are chosen using the L 1 norm, meaning the neurons with the lowest magnitude are masked out. Pruning can either be done w.r.t. individual layers or all of them combined, also referred to as layer-wise and global pruning. m i is then carried over to the subsequent pruning iteration i + 1 with the model f (x, m i θ 0 ) and retrained once again. At iteration i, the fraction of weights pruned is therefore 1 − (1 − p) i .

Data
Datasets We examine fairness among heavily pruned models using two text classification datasets: i) The multilingual Trustpilot Corpus (Hovy et al., 2015), 2 which contains user reviews from the Trustpilot website of various companies and services in five different countries (Germany, Denmark, France, United Kingdom and United States). The reviews are based on a one to five star rating scale and some are accompanied by demographic attributes about the author, such as gender, age and location. 2) The CivilComments dataset (Borkan et al., 2019), 3 which contains comments annotated for toxicity, for the purpose of hate speech detection. A subset of the comments are also annotated for the protected attributes they address, including gender, race, and religion.
Preprocessing For the Trustpilot Corpus, we divide the data into demographics based on a combination of gender (male/female), age (young/old) and location (NUTS regions). For age, young is defined as being 35 or less. We exclude the French and American parts of the datasets as they do not have properly annotated NUTS regions. For UK and Germany, we use NUTS-1 regions, and for Denmark, where more data is available, we use NUTS-2 regions. We convert the 5-star ratings to binary sentiment labels, grouping 4 and 5 stars as positive, and 1 and 2 as negative. Neutral reviews (three stars) are discarded. 4 Likewise for CivilComments, we threshold comments with a toxicity rating > 0.5 as toxic, and otherwise label them as a non-toxic. This is similar to the binarization performed in Koh et al. (2020). Comments can for each demographic sub-attribute contain multiple partial values (e.g. asian = 0.3, black = 0.4 for the race attribute), so for each annotated attribute we assign it the sub-attribute with the largest value. In our experiments we consider demographics based on combinations of the race and gender attributes. For each language and dataset we randomly sample 100, 200 or 500 of each demographic as test sets, based on the the amount of annotated datapoints in the dataset, and use a 80-20 split of the remaining data for training and validation. If a demographic contains less than the specified number of datapoints, we disregard it. Due to high class imbalance, the majority class for our train-val data is downsampled to match the minority class.   neural networks for text classification.
FFNN The FFNN consists of the following: The embedding layer, which maps every token id in the text to a fixed size vector as a bag-of-embeddings and sums them together, resulting in a single representation e ∈ R | E d im|, followed by 3 fully connected layers of size R |E d im×h| , R |h×h| and R |h×2| respectively. We use the hyperbolic tangent activation between layers and each linear layer is initialized using He initialization (He et al., 2015).
LSTM The LSTM network is a 2-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997) which encodes our input text, followed by a fully connected layer for classification. The weights are initialized using U(− √ k, √ k) where k = 1 hidden size and the final fully connected layer uses He initialization. See all model hyperparameters used in Table 2.
Both the FFNN and LSTM models are trained using the Adam optimizer (Kingma and Ba, 2017) with a learning rate of 1e − 3 and a weight decay of 1e − 4.

Distributionally Robust Optimization Loss
Additionally, we also train our models with DRO loss (Levy et al., 2020). We use the implementation provided by Levy et al. (2020) 5 . For our experiments, a χ 2 uncertainty set of size 1 is used.
For all of our experiments, we extract our winning tickets over 20 pruning iterations and use a 5 https://github.com/daniellevy/fastdro/  pruning rate of p = 0.35. We run a total of 5 independent runs for each model-dataset combination.

Measuring group disparity
At each pruning step we measure the group disparity D, from a set of demographics D, between repeated runs R, by computing the maximum difference of F 1 scores as follows: 6 (1) Intuitively, this corresponds to the difference between the highest scoring run for the highest scoring demographic and the lowest counterpart. We compute FSP by taking the gradient of the linear fit of D over a P pruning steps multiplied by 100.
Distributionally Robust Optimization We ran comparable experiments using DRO loss (Hashimoto et al., 2018) to see whether the adverse effects of weight pruning on min-max fairness could be reduced by training with a more robust objective. This seems to hold true in some instances. We present a single plot for DRO in Figure 3, for feed-forward networks, layer-wise pruning on CivilComments; see the Appendix for more plots. Comparing with Figure 2 (left) the FSP metric is considerably lower than for baseline empirical risk minimization (0.089 vs. 0.497) while maintaining equal, or even better, performance at high pruning rates; but note from the red numbers in Table 3, that we only see this type of reduction in FSP in 3/8 cases for FFNNs, but DRO does reduce the average FSP for global pruning. In 5/8 cases for the LSTM, however, DRO does improves fairness, reducing the average FSP with both layer-wise and global pruning. 7 While fairness correlates with stability, the difference between FFNNs and LSTMs is not explained by stability differences (see plots in the Appendix), but should probably be attributed to the general performance differences between FFNNs and LSTMs, as well as relative overparameterization in FFNNs (see Footnote 1).

Conclusion
In this work, we take a first step in examining group disparity among heavily pruned models, using lottery ticket extraction, in NLP. We measure group disparity, using fairness sensitivity to pruning, on the Trustpilot Corpus, a sentiment classification dataset covering 3 languages, as well as CivilComments, a toxicity classification dataset, for both feed-forward and recurrent neural networks. We find that models subject to heavy pruning are more susceptible to higher levels of group disparity, but that this effect can to some degree be mitigated using distributionally robust optimization objectives.