Fairness-aware Class Imbalanced Learning

Class imbalance is a common challenge in many NLP tasks, and has clear connections to bias, in that bias in training data often leads to higher accuracy for majority groups at the expense of minority groups. However there has traditionally been a disconnect between research on class-imbalanced learning and mitigating bias, and only recently have the two been looked at through a common lens. In this work we evaluate long-tail learning methods for tweet sentiment and occupation classification, and extend a margin-loss based approach with methods to enforce fairness. We empirically show through controlled experiments that the proposed approaches help mitigate both class imbalance and demographic biases.

Bias in data often also manifests as skewed distributions, especially when considered in combination with class labels. This is often referred to as "stereotyping" whereby one or more private attributes are associated more frequently with certain target labels, for instance more men being employed as surgeons than women. Prior work has identified several classes of bias, including 1 Code available at: https://github.com/ shivashankarrs/classimb_fairness C Figure 1: Example of a two-class problem where grey and blue points denote majority and minority classes, respectively, and circles and squares denote two sub-groups. Imbalanced learning methods such as LDAM (Cao et al., 2019) maximise the (soft-)margin for minority classes and do not consider sub-groups within each class.
bias towards demographic groups based on gender, disability, race or religion (Caliskan et al., 2017;Garimella et al., 2019;Nangia et al., 2020), and bias towards individuals (Prabhakaran et al., 2019). Methods to mitigate these biases include data augmentation (Badjatiya et al., 2019), adversarial learning (Li et al., 2018), instance weighting based on group membership (Kamiran and Calders, 2011), regularization (Wick et al., 2019;Kennedy et al., 2020), and explicit subspace removal (Bolukbasi et al., 2016;Ravfogel et al., 2020). This paper draws a connection between classimbalanced learning and stereotyping bias. Most work has focused on class-imbalanced learning and bias mitigation as separate problems, but the unfairness caused by social biases is often aggravated by the presence of class imbalance (Yan et al., 2020). Class-imbalanced learning approaches improve the performance of minority classes at some cost to the performance of majority classes. A common approach re-weights instances in the training objective to be proportional to the inverse frequency of their class. Approaches such as FOCAL (Lin et al., 2017) and DICE (Li et al., 2020) extend this approach by down-weighting "easy" instances. Label-Distribution-Aware Margin Loss ("LDAM": Cao et al. (2019)) is an alternative approach, which encourages a larger margin for the minority class, but it does not consider sub-group proportions (see Figure 1). On the other hand, debiasing approaches do not typically focus on class imbalance explicitly. For instance, in toxicity classification, certain sub-groups are often predicted more confidently for toxicity (encouraging false negatives for the majority sub-group), which tend to be close to the margin for the non-toxic class (encouraging false positives; Borkan et al. (2019)).
In this work, we modify the training objective of LDAM as a state-of-the-art approach for imbalanced learning so that margins depend not just on class-imbalance, but also on the subgroup distribution within each class. Specifically, we extend LDAM with popular debiasing strategies. We show the effectiveness of our approach through several controlled experiments on two text classification data sets.

Proposed Approaches
Let (x, y, g) denote a training instance, comprising an input, label, and group identifier, respectively. where z = f (x) are the model outputs, k is the number of classes, n j is the number of instances in class j, and C is a hyperparameter. Smaller classes are associated with a larger ∆ y , which is subtracted from the model output z y , thus enforcing a larger margin. We propose three extensions to LDAM, each of which takes into account imbalance in the distribution of private attributes across classes: LDAM iw adds instance re-weighting, based on groups within each class: where g is the group of instance x; and ω y,g = 1−β 1−β Ny,g weights each class-group combination based on its smoothed inverse frequency. β is a constant set to 0.9999 (Cui et al., 2019) and N y,g is the number of instances belonging to class y and group g. Mistakes on minority groups within minority classes are penalised most.
LDAM adv adds an adversarial term, so that the learnt representations are a poor predictor of group membership (Li et al., 2018): where f shares the lower layers of the network with the adversary l, CE denotes cross-entropy loss, and λ adv is a hyperparameter. This objective L LDAM adv is jointly minimised wrt f and maximised wrt l. The penalty results in hidden representations that are informative for the main classification task (f ), but uninformative for the adversarial group membership prediction task (l). The adversarial loss is implemented using gradient reversal (Ganin and Lempitsky, 2015).
LDAM reg adds a soft regularization term which encourages fairness as equalised odds by reducing maximum mean discrepancy (Gretton et al., 2012) across groups. The probability of predicting some class k for any individual group g should be close to k's probability over the whole data set: where we have moved from single instance loss to the loss over the full training set, N g denotes the number of training instances in group g, and hyper-parameter ρ controls the trade-off between performance and fairness.

Experimental Results
We perform experiments on the two tasks of emoji prediction and occupation classification, both of which are binary classification tasks with binary protected attributes.
Emoji prediction: We use the Twitter dataset of Blodgett et al. (2016), where tweets are associated with the private attribute race (black/white), and sentiment labels are derived from emoji usage (happy/sad) (Elazar and Goldberg, 2018). We experiment with different levels of class and stereotyping imbalance in the Emoji dataset, including its original distribution (see Section 3.1).
Occupation classification: This data set consists of short biographies scraped from the web, annotated for private attribute gender (male/female) and target occupation labels (De-Arteaga et al., 2019). We focus on two occupations with well-documented gender stereotypes -surgeon and nurse. The resulting dataset is mildly class-imbalanced (59% surgeon: 41% nurse), with roughly symmetric natural gender splits (90% male for surgeon and 90% female for nurse).
For emoji prediction we follow Ravfogel et al. (2020) and use the DeepMoji encoder, which was trained on millions of tweets and is known to encode demographic information (Elazar and Goldberg, 2018). For occupation classification, we use the BERT-base uncased model and classify via the last hidden state of the CLS token (Devlin et al., 2019). Both encoders are followed by a single hidden layer MLP. We evaluate classification performance based on macro-averaged F-score (to account for class imbalance), and evaluate fairness using performance GAP: the average of the true positive rate (TPR) and true negative rate (TNR) differences between the two subgroups (De-Arteaga et al., 2019;Ravfogel et al., 2020). Note that a wide variety of fairness measures (both on the group-and individual levels) have been proposed, which are impossible to satisfy simultaneously (Garg et al., 2020). Often, a suitable measure is chosen based on the target application. Here we use the popular equalised odds measure considering both TPR and TNR of classifiers, in order to address scenarios where certain subgroups are predicted more often with some classes (see Section 1). We report fairness as 1−GAP, such that higher numbers are better, and a perfectly fair model achieves 1− GAP = 1. We compare our methods against the following benchmarks: vanilla: unweighted cross-entropy loss.
CW: instance re-weighting based on the inverse class proportion and cross-entropy.
IW: instance re-weighting based on the combination of inverse class and group proportions, and cross-entropy (Kamiran and Calders, 2011).
INLP: Iterative null-space projection (Ravfogel et al., 2020): in each iteration, we learn a SVM classifier W using hidden representations (X h ) as the independent variables to predict the protected attribute, where X h is projected onto the nullspace of W to remove the protected information.

Model Comparison
We include simulated experimental settings with the emoji dataset following Ravfogel et al. (2020) where they keep the class proportions balanced, but vary group proportions (stereotyping). In our work, we systematically vary both class imbalance and stereotyping, in order to assess the robustness of the models wrt class imbalance and fairness individually. We explore three settings: varying both dimensions at the same time (Figure 2), controlling for class imbalance and vary stereotyping (Table  1), and controlling for stereotyping while varying class imbalance (Table 2). We simultaneously vary stereotyping and class imbalance in the emoji dataset, exploring several settings: • Original: the dataset is sampled based on the natural class distribution (70% positive; Blodgett et al. (2016)), and within each class the black:white ratio is set to 18:82, based on US census estimates.
• 95/95: as per the above, but with class skew and stereotyping ratios set to 95:5.
For the occupation classification task, the original data is used as is (Figure 3).     Model Selection. For models with hyperparameters which trade off performance and fairness, the optimal balance of F-score and fairness is not clear, so we adopt the concept of Pareto optimality (Godfrey et al., 2007) and present the Pareto frontier in the graphs. In particular, for LDAM reg and LDAM adv , we perform a hyperparameter search over C (10 −2 to 30), ρ (10 −4 to 10 2 ), and λ (10 −4 to 10 2 ). In general, a higher C prioritises F-score over fairness, and a higher ρ and λ prioritise fairness. For INLP, we tune the number of iterations as a hyper-parameter. The remaining models don't have trade-off hyper-parameters, so we report a single-point best model: for LDAM and LDAM cw we tune C by choosing the bestperforming model over the dev set. For LDAM iw , we set β to 0.9999 following Cui et al. (2019), and tune C to identify the fairest model on the dev set.
The results in Figure 2 (a)-(c) show that LDAM reg is overall superior to the other approaches, especially for higher F-scores. For increasingly extreme levels of class imbalance and stereotyping (as we move to the right in the figure), the advantage of LDAM reg over LDAM adv and INLP decreases substantially. Across all the settings, LDAM cw has the highest bias (is least fair). With higher class imbalance and stereotyping, most class-imbalanced learning methods-FOCAL, CW and LDAM cw -exhibit high bias. In the stereotyping settings, LDAM iw reduces bias compared to IW.
Analogous results on the occupation data (origi- nal class proportions) are in Figure 3. Once again, the proposed LDAM extensions perform the best overall, with LDAM reg achieving the best tradeoff in performance. Class-imbalanced learning approaches (FOCAL, LDAM, and LDAM cw ) are most biased on this dataset, with IW improving fairness over vanilla cross-entropy training, and LDAM iw providing large improvements in fairness.

Stereotyping-Class balance Trade-off
In addition to comparing models based on the tradeoff between performance and fairness in classimbalanced learning, we wish to disentangle the effect of stereotyping from class imbalance. We do so by: (a) fixing class balance to 50:50 and varying stereotyping (Table 1); and (b) fixing stereotyping to a symmetric 0.8 while varying class imbalance (Table 2). A stereotyping level of symmetric 0.8 means 80:20 black:white for positive and 20:80 black:white for negative tweets. We perform model selection by choosing the INLP model with best harmonic mean of performance and fairness for Table 2, and use the results from the original paper for Table 1. We select all other models by first selecting from models with F-score at least as high as INLP, and then selecting the one with the lowest GAP. We include a recent adversarial model in the varying stereotyping experiments, which performed strongly on the class-balanced emoji data (ADV: ).
Our results on varying stereotyping levels in Table 1 show that the vanilla baseline drops in performance more sharply than most proposed models, and results in the most unfair predictions by a large margin. LDAM iw , LDAM adv , and ADV retain high F-scores but drop in fairness with increasing stereotyping, while INLP exhibits the opposite pattern. LDAM reg achieves the best balance of F-score and fairness. Table 2 presents results for fixed stereotyping and varying class imbalance (0.7-0.9 positive). We include LDAM cw for handling class imbalance but exclude ADV, which does not address class-imbalance directly. We observe that LDAM cw has the highest F-score, but scores poorly for fairness. LDAM iw achieves the best trade-off with high class-imbalance, but shows large variation across settings. LDAM reg appears more stable, exhibiting a good performance-fairness trade-off.

Conclusion and Future Work
We explored the interplay of class-imbalance and stereotyping in two language classification data sets. We showed that vanilla class-imbalanced learning (IW, CW, FOCAL, LDAM and LDAM cw ) can exacerbate unfairness. We extended classimbalanced learning approaches to handle fairness under stereotyping, and showed that our models provide consistent gains in fairness without sacrificing accuracy. Both LDAM reg which uses maximum mean discrepancy regularizer (Tzeng et al., 2014) and LDAM adv with adversarial loss (Ganin and Lempitsky, 2015) are different ways to make the text representation independent of demographic attributes. Consistent with previous work (Louizos et al., 2016) we find that LDAM reg is robust and performs best across several test scenarios, except in extremely skewed (or stereotyped) settings where the gains of LDAM reg over its adversarial counterpart (LDAM adv ) diminishes. In addition, LDAM adv introduces more parameters into the model, and is in general hard to train, hence LDAM reg is more preferable overall. In the future, we plan to extend our methods to more complex tasks and multiple private attributes (Subramanian et al., 2021