Evaluating Debiasing Techniques for Intersectional Biases

Bias is pervasive for NLP models, motivating the development of automatic debiasing techniques. Evaluation of NLP debiasing methods has largely been limited to binary attributes in isolation, e.g., debiasing with respect to binary gender or race, however many corpora involve multiple such attributes, possibly with higher cardinality. In this paper we argue that a truly fair model must consider ‘gerrymandering’ groups which comprise not only single attributes, but also intersectional groups. We evaluate a form of bias-constrained model which is new to NLP, as well an extension of the iterative nullspace projection technique which can handle multiple identities.


Introduction
Text data reflects the social and cultural biases in the world, and NLP models and applications trained on such data have been shown to reproduce and amplify those biases. Discrimination has been identified across diverse sensitive attributes including gender, disability, race, and religion (Caliskan et al., 2017;May et al., 2019;Garimella et al., 2019;Nangia et al., 2020;Li et al., 2020). While early work focused on debiasing typically binarized protected attributes in isolation (e.g., age, gender, or race; Caliskan et al. (2017)), more recent work has adopted a more realistic scenario with multiple sensitive attributes (Li et al., 2018) or attributes covering several classes (Manzini et al., 2019).
In the context of multiple protected attributes, gerrymandering refers to the phenomenon where an attempt to make a model fairer towards some group results in increased unfairness towards another group (Buolamwini and Gebru, 2018;Kearns et al., 2018;Yang et al., 2020). Notably, algorithms can be fair towards independent groups, but not towards all intersectional groups. Despite this, debiasing approaches within NLP have so far been evaluated only using independent group fairness when modelling datasets with multiple attributes, disregarding intersectional subgroups defined by combinations of sensitive attributes (see Figure 1). The primary goal of this work is to evaluate independent and intersectional identity debiasing approaches in relation to fairness gerrymandering for text classification tasks. To this end, we evaluate bias-constrained models (Cotter et al., 2019b) and iterative nullspace projection (INLP; Ravfogel et al. (2020)), a post-hoc debiasing method which we extend to handle intersectional groups. The constrained model jointly optimizes model performance and model fairness, while INLP seeks to learn a hidden representation which is independent of the protected attributes. INLP does not consider the trade-off between accuracy and fairness, but rather it iteratively maximizes fairness in an unconstrained fashion.
In this work, we address the following questions: • Are debiasing approaches based on independent groups more prone to fairness gerrymandering than methods using intersectional groups? • How do INLP and bias-constrained approaches, and their extensions to handle intersectional groups, fare compare in terms of both predictive accuracy and fairness?

Background
Debiasing with respect to more than a single protected attribute (|Z| > 1) requires grouping data points according to their associated protected values. In this work, debiasing approaches are trained wrt the three settings detailed below (Kearns et al., 2018;Yang et al., 2020), evaluated based on aggregated violations over G GERRY , which is a commonly used evaluation strategy for intersectional identities (Yang et al., 2020). Independent groups (G INDEP ) Each private attribute is treated as an independent group. In the case of all binary attributes, this gives 2|Z| groups. As illustrated in Figure 1, for binary gender and race, 1 F, M, W, and ¬W are four independent groups, which are overlapping, e.g., with F referring to both white females and female persons of colour (C and D). Intersectional groups (G INTER ) Nonoverlapping, exhaustive combinations of groups, which in the binary attribute case gives 2 |Z| combinations. For our example in Figure 1, the intersectional groups are A, B, C and D. Gerrymandering intersectional groups (G GERRY ) Overlapping subgroups are defined by group intersections of any subset of private attributes (3 |Z| for binary attributes, where the 3rd value denotes a wildcard). For the example in Figure 1, the G GERRY groups are F, M, W, ¬W, A, B, C, and D.

Proposed Approaches
We aim to learn a model to predict main task labels (e.g., sentiment) from input data X (e.g., reviews) by means of a learnt hidden representation X h . X h will generally encode implicit biases, and our goal is to predict the output without the influence of social identities encoded in X h . We assume access to the protected attributes for each instance. Below, we detail the two primary models targeted in this work: INLP, and the bias-constrained model.

INLP
INLP is applied to frozen hidden representations, where we first learn a linear classifier W using X h as the independent variables to predict a protected attribute (binary-or multi-valued). Next, X h is projected onto the nullspace of W (denoted P N (W) X h ) to remove the protected information, which forms the input for learning a classifier of the main task label. In real-world scenarios, bias may arise due to a combination of multiple protected attributes, hence we extend the single attribute setting 1 Higher cardinality attributes are also supported.
of Ravfogel et al. (2020) to this situation. Assuming k protected attributes, and classifiers trained to predict those denoted as {W i } k i=1 , we compute the intersection of nullspaces (Strang, 2014) No special properties are imposed over W i (e.g., orthogonality), which allows for parallel training.
Even after debiasing by projecting X h onto the nullspace of classifiers, hidden representations can still retain protected information (Ravfogel et al., 2020). Hence we follow Ravfogel et al. (2020) and resort to iteratively removing bias, rather than doing it in a single step. Nullspace projection will reduce the rank of data rapidly with intersectional identities 2 (see Figure 2). To circumvent this issue, we project onto the nullspace of the principal component of the bias subspace (N p ). This is equivalent to projecting onto the nullspace of some of the intersectional identities in each iteration; similar strategies have been employed for multi-class supervised principal component analysis (Piironen and Vehtari, 2018).
Following Ravfogel et al. (2020), while incorporating our principal component-based nullspace projection as described above, the intersection of nullspaces across iterations is computed as where N p denotes the nullspace based on the principal component of the bias subspace, W i is the set of classifiers trained in each iteration to predict the protected attributes, and R is the principal component of the rowspace computed by stacking classifier weight vectors across multiple protected attributes. Unlike Eq (1), the formulation in Eq (2) requires orthogonality of subspaces across each iteration.

Bias-constrained Model
A bias-constrained model combines the main task objective and fairness constraints, where (F(x, θ), y) denotes the primary objective (e.g., error rate); G are the groups defined in Section 2; φ g denotes the group-wise performance; and φ denotes overall performance of the model (e.g., TPR). ν is a slack variable, which controls the maximum deviation allowed. γ g is the inverse proportion of positive examples in each group, |g| |g y=1 | , meaning that the more underrepresented a group is wrt some target class (say toxicity), the smaller its accepted deviation from the overall performance.
Each group-wise constraint is denoted as ψ g . The constraints involve a linear combination of indicator variables, which is not differentiable wrt θ. A common approach to handle this constrained optimization problems is using the Lagrangian, which is minimized over θ and maximized over λ. Similar formulations have been used for learning fair models with structured data (Cotter et al., 2019b;Yang et al., 2020;Zafar et al., 2019). In this work, we apply this method to NLP tasks and use the two-player zero-sum game approach for optimization, where the first player chooses θ to minimize L(θ, λ), and the second player enforces fairness constraints by maximizing λ (Kearns et al., 2018;Cotter et al., 2019b;Yang et al., 2020). Specifically we use the implementations available in TensorFlow constrained optimization (Cotter et al., 2019a,b). 3 The approach iterates over T iterations, where the primal player updates θ with a fixed λ, and dual player updates λ with a fixed θ. This results in multiple models, {θ 1 , θ 2 , ...θ T } and {λ 1 , λ 2 , ...λ T }, which trade off performance and fairness. We use the Adam optimizer (Kingma and Ba, 2014) with learning rate 10 −3 .

Experimental Results
We evaluate using two binary classification datasets: (1)  (2) occupation classification using biography text. For INLP we use logistic regression to construct W. All the approaches have hyperparameters controlling the performance-fairness trade-off, and as no choice of hyperparameter will optimize both objectives simultaneously, we adopt the concept of Pareto optimality (Godfrey et al., 2007).

Settings
For INLP we consider different definitions of the bias subspace, which we illustrate wrt  (3), which we refer to as CON INDEP , CON INTER , and CON GERRY , respectively.
Metrics We assess a model's predictive performance using F-score, and fairness based on average-violations (Yang et al., 2020) of true positive rate (equality of opportunity), across all the G GERRY groups: 1 |GGERRY| GGERRY g=1 |tpr g − tpr|.

Hate speech detection
We use the English Twitter hate speech detection dataset of Huang et al. (2020), where each tweet is labeled with a binary hate speech label, as well as binary demographic group identity indicators for the tweet authors: gender (f or m), age (≤ or > median), location (US or not US), and race (white or other). We use the data splits from Huang et al. (2020), but further filter each partition to the subset of tweets for which all demographic author labels are available, resulting in 23K (train), 5K (dev), and 5K (test) tweets. We use a biGRU as our base model, following Huang et al. (2020). In Figure 2, we show the benefits of our proposed INLP approach for debiasing gerrymandering intersectional identities, compared to the naive extension of INLP for multiple protected attributes (which projects onto the entire bias subspace). With 80 G GERRY groups, the naive extension leads to a drastic drop in performance after 4 iterations (the dashed blue line), as in each iteration the rank is reduced by 80. On the other hand, the proposed approach achieves a much more nuanced trade-off between accuracy and fairness (the blue and yellow dotted line + stars, resp.).
We present the Pareto frontier for the different models on the test-set in Figure 3a, by varying the hyper-parameters of the approaches 4 each under of the three different debiasing settings, i.e. G INDEP , G INTER , and G GERRY . From the results, for F-score close to the base biGRU (= biased model), the independent model provides the best trade-off, while debiasing for intersectional identities provides a better trade-off overall, and constrained models perform better than INLP. This comprises short web biographies, annotated by gender and profession. We augment the dataset with economic status (wealthy vs. rest, using World Bank data) based on the country the individual is based in, labelled based on the first sentence of each bio, which we perform entity linking on using AIDA-light (Nguyen et al., 2014), and map locations and institutions to countries based on their Wikipedia articles. We computed the accuracy of the economic status mapping by manually analyzing the output of over 200 biographies, resulting in a 93% accuracy and 90% agreement (Jaccard coefficient) between three annotators. We use a subset of two highly stereotyped classes of the datasetnurse vs. surgeon -resulting in 13K/1.7K/5.4K instances in train/dev/test. 5 We use the BERT (base uncased) (Devlin et al., 2019) [CLS]-encoding followed by an MLP (300d with ReLU activations) for classification. The Pareto results 6 on the test set are shown in Figure 3b. INLP INTER and INLP GERRY perform better than INLP INDEP , and the constrained models once again perform best overall, with intersectional identities providing no significant gains over CON INDEP . Even in a highly stereotyped setting, constrained models achieve high predictive performance, in contrast to INLP which improves fairness but greatly reduces predictive performance, consistent with the single protected attribute results in Ravfogel et al. (2020).

Constrained analysis
In addition to Pareto curves (Figure 3), we compare the fairest models under a minimum performance 5 90% of surgeons are male and 87% of nurses are female; 97.5% of nurses are from wealthy economies. 6 Constrained models are run for T = 1 to 500 iterations.  constraint, where the most fair models (based on average violations) which exceed a performance threshold on the development set are chosen, and evaluated on the test set. This allows us to measure the generalizability of models occupying optimal Pareto region for the development set, to an unseen test set. We provide results under two constrained scenarios -models with least bias on development set trading off 5% and 10% of predictive performances. In addition to average violation, we also report maximum violation as max g∈GGERRY |tpr g − tpr|, which measures the upper bound of unfairness towards any subgroup (Yang et al., 2020). For hatespeech classification, results are provided in Table 1a. With a slack of up to 10% tradeoff in predictive performance, INLP GERRY shows better predictive performance and less violations than both INLP INTER and INLP INDEP . For the closer 5% tradeoff, INLP INTER has higher F-score with a slightly worse avg violation and better max violation. The constrained models debiasing for G GERRY are fairest (lowest max and avg violation). Similarly, for occupation classification (Table 1b), INLP GERRY achieves lowest maximum violation among INLP models, while overall the constrained models, specifically CON INDEP , perform best. This confirms the conclusion based on Pareto plots that adding intersectional identities for INLP-based de-biasing is beneficial, whereas for constrained models which directly regulate the expected performance wrt different subgroups, they only provide mild gains.

Conclusion and Future Work
We examined the impact of independent vs. intersectional groups on classifier fairness. We proposed an extended version of INLP, which we compared against bias-constrained models. INLP is a post-hoc debiasing method, while the constrained model requires full joint training. Debiasing in INLP happens in an unconstrained way, where sensitive information is removed from the latent representations independent of predictive performances, leading to fairer models at the cost of overall accuracy (Han et al., 2021). From a practical perspective, post-hoc debiasing may be the only option if it is impossible or undesirable to re-train a model from scratch with a regularized objective. Our paper shows that lighter-weight post-hoc debiasing is indeed possible in the case of overlapping groups, and quantifies the additional advantage of joint training and debiasing in the context of two data sets. The constrained models especially are more robust to stereotyped settings, and using intersectional identities benefits INLP in particular.