Counterfactual Inference for Text Classification Debiasing

Today’s text classifiers inevitably suffer from unintended dataset biases, especially the document-level label bias and word-level keyword bias, which may hurt models’ generalization. Many previous studies employed data-level manipulations or model-level balancing mechanisms to recover unbiased distributions and thus prevent models from capturing the two types of biases. Unfortunately, they either suffer from the extra cost of data collection/selection/annotation or need an elaborate design of balancing strategies. Different from traditional factual inference in which debiasing occurs before or during training, counterfactual inference mitigates the influence brought by unintended confounders after training, which can make unbiased decisions with biased observations. Inspired by this, we propose a model-agnostic text classification debiasing framework – Corsair, which can effectively avoid employing data manipulations or designing balancing mechanisms. Concretely, Corsair first trains a base model on a training set directly, allowing the dataset biases ‘poison’ the trained model. In inference, given a factual input document, Corsair imagines its two counterfactual counterparts to distill and mitigate the two biases captured by the poisonous model. Extensive experiments demonstrate Corsair’s effectiveness, generalizability and fairness.


Introduction
Text classification, mapping text documents to a set of predefined categories, is a fundamental and important technique serving for many applications such as sentiment analysis (Qian et al., 2020b), * This work was partly done during Chen Qian's internship at Alibaba DAMO academy. Fuli Feng and Lijie Wen are the co-corresponding authors. 1 The code is available at https://github.com/ qianc62/Corsair. partisanship recognition (Kiesel et al., 2019) and spam detection (Castillo et al., 2007). Machine learning models have become the default choice of solving text classification, owing to their ability to recognize the textual patterns from the labeled documents (Kim, 2014;Howard and Ruder, 2018). Nevertheless, they are at the risk of inadvertently capturing and even amplifying the unintended dataset biases (Zhao et al., 2017;Feder et al., 2020;Blodgett et al., 2020), which can be at document-level (i.e., label bias) and word-level (i.e., keyword bias).
The label bias issue occurs in the scenarios where a portion of the categories possesses a majority of training examples than others. For example, the label distribution of a binary sentiment analysis dataset could be 95%:5% (Dixon et al., 2018). Many previous studies found that the models trained on such data are potentially at the risk of simply predicting the majority answers (Dixon et al., 2018;. The keyword bias issue occurs in the situation where trained models exhibit excessive correlations between certain words and categories, e.g., some sentimentirrelevant words -"black" or "islam" -are always connected to negative category. As such, models always lean to unfairly predict any document containing those keywords to a specific category according to the biased statistical information instead of intrinsic textual semantics (Waseem and Hovy, 2016;Liu and Avci, 2019). The serious disadvantages limit models' generalization, especially in the scenarios where the training data is differently-distributed with the testing data (Niu et al., 2021;Goyal et al., 2017).
To resolve the issues, an effective solution is to perform data-level manipulations (e.g., resampling (Qian et al., 2020b)), which effectively transforms a training set to a relatively balanced one before training. Another line of debiasing work typically designs model-level balancing mechanisms (e.g., reweighting ), aiming to adaptively decrease the influence of majority categories while increasing the minority during training. The core of the two types of solutions is to explicitly or implicitly recover unbiased distributions and prevent models from capturing the unintended biases. Unfortunately, the data-level strategy typically suffers from the extra manual cost of data collection, selection and annotation , requires much longer training time and normally enlarges the gap between training and testing data distributions. The model-level strategy typically needs elaborate selection or definition of balancing strategies and needs relearning from scratch once certain balancing mechanisms (e.g., an unbiased training objective) are redesigned.
Must machine learning models perform debiasing before or during training? Think about the difference in the decision making processes between machines and humans. Machine learning systems are forced to imitate the behavior from observations via maximizing the prior probability, from which the decision is directly drawn during inference. By contrast, we humans, although born and raised in a biased nature, have the ability of counterfactual inference to make unbiased decisions with biased observations (Niu et al., 2021). To illustrate, we briefly compare the traditional factual inference and the counterfactual inference in text classification: • Factual Inference: What will the prediction be if seeing an input document? • Counterfactual Inference: What will the prediction be if seeing the main content of an input document only and had not seen the confounding dataset biases? The counterfactual inference essentially gifts humans the imagination ability (i.e., had not done) to make decisions with a collaboration of the main content and the confounding biases (Tang et al., 2020), as well as to introspect whether our decision is deceived (Niu et al., 2021), i.e., counterfactual inference leads to debiased prediction.
Inspired by this, we propose a novel modelagnostic paradigm (CORSAIR), which adopts factual learning before mitigating the negative influence of the dataset biases in inference (i.e., after training), without the need of employing data manipulations or designing balancing mechanisms. Concretely, in training, CORSAIR directly trains a base model on an original training set, allowing the unintended dataset biases "poison" the model. To "rescue" the testing documents from the poisonous model, in testing, for each factual input document, CORSAIR imagines its two types of counterfactual counterparts to produce two counterfactual outputs as the distilled label bias and keyword bias. Lastly, CORSAIR performs a bias removal operation to produce a counterfactual prediction that corresponds to a debiased decision. To verify, we perform extensive experiments on multiple public benchmark datasets. The results demonstrate our proposed framework's effectiveness, generalizability and fairness, proving that CORSAIR, when employed on four different types of base models, is significantly helpful to mitigate the two types of dataset biases.

Methodology
Problem Formalization Let X and Y denote the input (text document) and output (category) spaces, respectively. Given a labeled training set D train = {(x i , y i ) ∈ X × Y} (i.e., the observed data), the goal is to learn a text classifier M on D train , which serves as a mapping function f (·) : X → Y to accurately classify testing examples in D test = {x|x ∈ X }.
Considering that the dataset biases would not be completely eliminated via data manipulations, employing data manipulations (e.g., resampling) or designing balancing mechanisms (e.g., reweighting) may be not a directly-reasonable solution. Inspired by the success of counterfactual inference in mitigating biases in computer vision (Niu et al., 2021;Wang et al., 2020;Tang et al., 2020;Yang et al., 2020;Goyal et al., 2017), we propose a counterfactual-inference-based text-classification debiasing framework (CORSAIR), which is able to make unbiased decisions with biased observations. The core idea of CORSAIR is to train a "poisonous" text classifier regardless the dataset biases and post-adjust the biased predictions according to the causes of the biases in inference. It's worth mentioning that our proposed CORSAIR can be applied to almost any parameterized base model, including traditional one-stage classifiers (e.g., TEXTCNN (Kim, 2014), RCNN (Lai et al., 2015) and LECO (Qian et al., 2020b)) and currently prevalent two-stage classifiers 2 (e.g., ULM-  Figure 1: The architecture of our proposed model-agnostic framework (CORSAIR). Specifically, CORSAIR first trains a base model on the training data directly so as to preserve the dataset biases in the trained model. In the inference phase, given a factual input document, CORSAIR first imagines its two types of counterfactual documents to produce two counterfactual outputs as the distilled label bias and keyword bias. Finally, CORSAIR searches two adaptive parameters to perform bias removal to produce a counterfactual prediction for a debiased answer.
FIT (Howard and Ruder, 2018), BERT (Devlin et al., 2019) and RoBERTa ). For brevity, we will elaborate CORSAIR by taking RoBERTa (a robustly optimized BERT-shape language model) as the example base model, and binary sentiment analysis as the example application. The high-level architecture of CORSAIR is illustrated in Figure 1, which consists of three main components: biased learning, bias distillation and bias removal.

Biased Learning
In the learning phase (i.e., training), CORSAIR first trains the base model RoBERTa to learn a mapping relation based on training data. Similar to traditional training, CORSAIR uses feedforward to predict batch examples and backward to update those learnable parameters in an end-to-end fashion. In practice, we adopt the standard cross entropy as the training objective (i.e., loss function): where θ denotes the learnable parameters of the base model f (·), n is the number of batch examples, π i is the ground-truth label distribution (over Y) and π i is the predicted probability distribution (over Y) for a given training example x i .

Bias Distillation
In the inference phase (i.e., testing), traditional debiasing methods making predictions for each testing document via the conventional feedforward operation on the trained base model to obtain the probability distribution over Y (i.e., factual prediction) for a most possible answer. However, in addition to the textual contents of the document, the prediction is also affected by unintended confounders (Pearl and Mackenzie, 2018) which may produce the label bias and keyword bias. Aiming to obtain unbiased prediction, the key is to debias during inference by blocking the spread of the biases from learning to inference. To achieve that, inspired by the counterfactual studies in causal reasoning (Niu et al., 2021;Tang et al., 2020), we design an effective strategy based on causal intervention (Pearl, 2013;Pearl and Mackenzie, 2018) to distill the potentially-harmful biases captured by the trained model (Niu et al., 2021;Tang et al., 2020), and then mitigate them via bias removal.

Causal Graph
Aiming to conduct proper causal intervention, we first formulate the causal graph (Pearl, 2013;Pearl and Mackenzie, 2018;Tang et al., 2020) for the text classification models (see the left-bottom part of Figure 1), which sheds light on how the document contents and dataset biases affecting the prediction. Formally, a causal graph is a directed acyclic graph G = (N , E), indicating how a set of variables N causally interact with each other through the causal links E. It provides a sketch of the causal relations behind the data and how variables obtain their values (Tang et al., 2020), e.g., (X, M )→Y . In this causal graph, X, Y and M denote a text document's embedding, its corresponding prediction and the trained model which inevitably captures unintended confounders existing in training data, respectively.

Label Bias Distillation
According to the causal graph, we diagnose how the dataset biases existing in training data misleads inference. Concretely, by using Bayes rule (Wang et al., 2020), we can view the inference as: where c could be any confounder captured by the model trained on a biased training set (e.g., the overwhelming majority of training documents fall in POSITIVE). Under such circumstances, once the training documents corresponding to the POSITIVE category are dominating than NEGA-TIVE, the trained model tends to build strong spurious connections between testing documents and POSITIVE, achieving high accuracy even without knowing testing documents' main contents. As such, the model is inadvertently contaminated by the spurious causal correlation: X←M →Y , a.k.a. a back-door path in causal theory (Pearl and Mackenzie, 2018;Pearl, 2013). To decouple the spurious causal correlation, the back-door adjustment (Pearl and Mackenzie, 2018;Pearl, 2013;Pearl et al., 2016) predicts an actively intervened answer via the do(·) operation: wherex could be any counterfactual embedding as long as it is no longer dependent on M to detach the connection between X and M . As illustrated in the fully-blindfolded counterfactual world in Figure 1, the causal intervention operation wipes out all the in-coming links of a cause variable X, which encourages the model M to inference without seeing any testing document, i.e., RoBERTa should be fully blind in order to detaching the connection between M and X. To achieve that, we usex to denote the imagined fully-blindfolded counterfactual document where all words in the test document x are consistently masked (to create a counterfactual embedding), and f (x) as the corresponding counterfactual output via feedforward through the trained model. Since the model cannot see any word in the factual input x after fully blindfolding, f (x) actually reflects the pure influence from the trained base model M . Furthermore, f (x) refers to the output (e.g., a probability distribution or a logit vector) where no textual information is given. Thus, the fully-blindfolded counterfactual output: naturally reflects as the label bias captured by M , where [MASK] is a special token to mask a single word. Due tox is fully-blindfolded and independent with trained model M , in implementation, we follow Wang et al. (2020) to use the average document feature on the whole training set as its embedding of the counterfactual document.

Keyword Bias Distillation
Inspired by the factual inference where all textual information in test documents are exposed to the base model and the fully-blindfolded case where all textual information in each test document are not exposed, we make the first attempt to utilize a partially-blindfolded counterfactual document where some words in the test document x are masked to distill the keyword bias from the trained base model. Specifically, we deliberately expose some words which may potentially cause spurious correlations (e.g., the spurious "black"-to-NEGATIVE mapping) to the trained model to exhibit their potentially negative influence. Some evil words may serve as unintended confounders (Tang et al., 2020), splitting a document into two pieces: main content and relatively-unimportant context. In the following, we usex to denote another counterfactual document where the main-content words in a test document x are masked while other context words are not, and f (x) as the corresponding counterfactual output. To achieve that, an effective masking strategy is to use discriminative text summarization methods to extract the main content of the document, before masking content words (important classification clues) and exposing others as potentially harmful biasing factors. Since the model is forced to see only the non-masked context words in x, f (x) actually reflects the influence from both the potentially harmful contexts and the trained model. Thus, the partially-blindfolded counterfactual output: naturally reflects as the keyword bias captured by M for a specific text document x, where x content and x context denote the main content and the con-text of x, respectively. Inspired by a recent counterfactual word-embedding study of Feder et al. (2020), to realize discriminative text summarization, we use Jieba 3 tool, whose TextRank-based interface can effectively extract the words that may influence the semantics of a sentence as content, leaving potentially discriminative/unfair keywords (e.g., stop words, a part of adjectives, and semantically-unimportant particles) as contexts. Empirically, the average ratio of contents to contexts produced by Jieba on all datasets is approximately 62.03%:37.97%.

Bias Removal
Our final goal is to use the direct effect from X to Y for debiased prediction, removing (\) the label bias and the keyword bias existing in training data (i.e., blocking the spread of the biases from training data to inference): f (x)\f (x)\f (x). The debiased prediction via bias removal can be formalized via the conceptually simple and empirically powerful element-wise subtraction operation: where f (x) and c(x) correspond to the traditional factual prediction and our counterfactual prediction, respectively; f (x) and f (x) correspond to the label bias and the keyword bias distilled from the trained base model, respectively;λ andλ are two independent parameters balancing the two types of biases. Note that the two distilled biases could be probability distributions over all categories or logit vectors (i.e., without normalization), and they typically do not contribute completely equally to the final classification. As such, in Equation 6, directly subtracting without adaptive parameters (i.e.,λ=λ= 1 2 ) would cause that mitigating a certain bias too much or too less for a specific testing set. Therefore, we propose the elastic scaling mechanism to search two adaptive parameters (scaling factors) -λ * andλ * -on the validation set to amplify or penalize the two biases, which would dynamically adapt to different datasets according to the extent to which two biases in training set "poison" the validation set. In practice, elastic scaling can be implemented using grid beam search (Hokamp and Liu, 2017) in a scoped twodimensional space: λ * ,λ * = arg max λ,λ ψ(Ddev, c(x;λ,λ))λ,λ ∈ [a, b] (7) 3 https://github.com/fxsjy/jieba where ψ is a metric function (e.g., recall, precision and F 1 -score) to evaluate the performance on the validation set D dev =(X dev , Y dev ); a and b are the boundaries of the search range. The two factors are at dataset-level and thus searched only once for each validation set, and would be used in inference for all testing documents.

Evaluation
Baselines We choose four types of representative text classifiers as the base models of our proposed framework, covering classical, data-manipulation-based, model-balancing-based, as well as large-scale and two-stage methods. TEXTCNN (Kim, 2014) is a classical classifier that uses convolutional neural networks (CNN) with scale-variant convolution filters to capture local textual features, which may potentially capture spurious correlations between certain keywords and categories. LECO (Qian et al., 2020b) utilizes the combination of the implicit encoding of deep linguistic information and the explicit encoding of morphological features, which would also capture the keyword bias inadvertently. Besides, it uses a sentence-level over-sampling mechanism (He and Garcia, 2009) to mitigate the label bias, and we further enhance it via a powerful word-level augmentation technique (EDA) (Wei and Zou, 2019) to mitigate the keyword bias, denoted as LECOEDA. WEIGHT  is a most recent debiasing text classifier that uses a specially-designed reweighting technique under an unbiased objective for fair (i.e., nondiscrimination) learning, which is proven effective to mitigate the unfairness or discrimination issue caused by unintended dataset biases. RoBERTa  is an improved version of BERT, whose effective modifications allow RoBERTa to generalize better and match or exceed the performance of many post-BERT methods, serving as a very strong baseline in recent work (Gururangan et al., 2020).
Metric We use the widely-used macro-F 1 metric, which is the balanced harmonic mean of precision and recall. Furthermore, macro-F 1 is more suitable than micro-F 1 to reflect the extent of the dataset biases, especially for the highly-skewed cases, since macro-F 1 is strongly influenced by the performance in each category (i.e., categorysensitive) but micro-F 1 easily gives equal weight over all documents (i.e., category-agnostic) (Kim et al., 2019).

Implementation Details The search range in Equation 7
is set as [−2.0, 2.0]. Each training is run for 10 epochs with the Adam optimizer (Kingma and Ba, 2015), a mini-batch size of 16, a learning rate of 2e −5 , and a dropout rate of 0.1. We implement CORSAIR via Python 3.7.3 and Pytorch 1.0.1. All of our experiments are run on a machine equipped with seven standard NVIDIA TITAN-RTX GPUs.

Overall Performance
We report the average results over five different initiations in Table 2. We can observe that COR-SAIR consistently improves the four types of representative baselines on almost all datasets with a significance level, regardless of the languages, domains, volumes and applications of the datasets, which validates the effectiveness and the generalizability of the proposed framework. Furthermore, since CORSAIR performs debiasing between the traditional factual predictions and two counterfactual outputs to produce counterfactual predictions, the comparison between each baseline and its CORSAIR-equipped counterparts highlights the importance of the counterfactual inference, which is largely ignored by most of previous text classification methods. Particularly, CORSAIR can even benefit the data-manipulation-based method (i.e., LECOEDA) and the model-balancing-based method (i.e., WEIGHT) consistently, which in turn verifies our initial intuition that the dataset biases would not be completely eliminated via data manipulations merely, and further illuminates our key insight -preserving biases in models before debiasing in inference.
We can also notice that CORSAIR sometimes hurts performance (e.g., RoBERTa+CORSAIR on HYP and ARC); we conjecture the phenomenon comes from the small-scale data, making the giant model RoBERTa overfits and thus "fail" to distill two potential biases that are identically distributed with the ideal distributions of factual biases. Moreover, finetuning a RoBERTa model on large-datasets (e.g., SUN) would take about 36 hours, nearly 50 times that of training a WEIGHT model (about 44 minutes); we thus suggest to use lightweight base models in practice with considering systems' robustness and efficiency. Besides, the proposed framework works only in inference and can thus be employed on the previous alreadytrained models. Therefore, by leveraging counterfactual inference, our approach can serve as a powerful, "data-manipulation-free" and "modelbalancing-free" weapon to enhance different types of text classification methods.

Bias Analysis
According to Sweeney and Najafian (2020), the more imbalanced/skewed a prediction produced by a trained model is, the more unfair opportunities it gives over predefined categories, the more unfairly-discriminative the trained model is. We thus follow previous work (Xiang and Ding, 2020;Sweeney and Najafian, 2020) to use the metricimbalance divergence -to evaluate whether a prediction (normally a probability distribution) P is imbalanced/skewed/unfair: where D(·) is defined as the distance of P and the uniform distribution U (with |P | elements). Concretely, we use the JS divergence as the distance  metric since it is symmetric (i.e., JS(P ||U ) = JS(U ||P )) and strictly scoped (in [0.0, 1.0]) compared with the KL divergence. Based on this, to evaluate the label bias and the keyword bias of a trained model M , we average its relative label imbalance (RLI) over the predicted distributions of all the testing documents, and the relative keyword imbalance (RKI) over all the testing documents containing whichever context word, respectively: where a prediction P (x) could be a factual prediction f (x) or a counterfactual one c(x); V denotes the vocabulary of context words. The two metrics implicitly capture the distance between all predictions and the fair uniform distribution U . Table 3 shows the average results of the bias analysis investigation over five different initiations. The results show that our framework re-duces the imbalance metrics (lower is better) when employed on non-data-balanced baselines significantly and consistently, indicating it is indeed helpful to mitigate the two dataset bias issues. We all know that data-balanced LECOEDA perfectly mitigates the label bias issue via data balancing, thus achieving the lowest RLI. Due to the powerful debiasing operations via strictly balancing data, it serves as the skyline of RLI. This finding is similar to previous evidence of Morik et al. (2020). Moreover, we can also see that LECOEDA reduces the RKI, validating that data manipulation methodology is indeed helpful to debias the keyword bias issue but fails to eliminate it completely; our framework can further reduce RKI (1.73↓). Note that WEIGHT exhibits a more severe keyword bias than label bias (34.85 vs. 08.88). The key reason is that WEIGHT explicitly balances each category according to a theoretically fair objective but ignores the consideration of label distributions conditioned on finergrained words. Moreover, RoBERTa exhibits the most imbalanced prediction against all baselines and across small-and large-scale datasets (e.g., ARC and TAO), indicating that its answers excessively distribute on certain categories due to the overfitting phenomenon rooted from its largescale parameters (about 110M). Luckily, by being equipped with our framework, the RoBERTa case remarkably reduces the imbalance issue caused by dataset biases (9.50↓ and 4.41↓).
Another finding is that the keyword bias issue typically is more severe than the label bias, meaning that trained models typically utilize the wordlevel information to inference, which could catch angel keywords as good clues but also inevitably utilize evil keywords that are potential biases. Additionally, the keyword bias issue, compared with label bias, is much harder to be completely eliminated via data manipulations, which imposes a caution for relevant studies to keep a watchful eye on the detrimental causal correlations.

Ablation Study
We conduct ablation studies on CORSAIR to empirically examine the contribution of its main mechanisms/components, including the label bias removal operation (\LBR), the keyword bias removal operation (\KBR) and the elastic scaling mechanism (\ES).
The average results of the ablation study are shown in Table 4. We can see that removing the proposed CORSAIR causes serious performance degradation, dropping F 1 -score by 7.55 points for the WEIGHT case. Additionally, it also provides evidence that using the counterfactual framework for text classification can explicitly mitigate two types of dataset biases to generalize better on unseen examples. Moreover, we observe that mitigating the two types of biases are consistently helpful for classification tasks. The key reason is that the distilled label bias provides a global (i.e., document-agnostic) offset and the distilled keyword bias provides a local (i.e., documentspecific) one to "move" in the predicted space, which makes the trained models "blind" to see potentially harmful biases existing in observed data so as to focus only on the main content of each document to inference. Meanwhile, elastic scaling effectively finds two dynamic scaling factors to amplify or shrink two biases, making the biases be mitigated properly and adaptively.

Further Investigation on Counterfactual Learning
Recall that our proposed framework first trains a base model on a training set directly (factual learning) so as to preserve dataset biases in the trained model, and in the inference phase, given a factual input document, CORSAIR imagines two types of counterfactual documents aiming to produce two counterfactual outputs as the distilled label bias and keyword bias for bias removal. That is, the framework deliberately causes the discrepancy between learning and inference, leading to an operational gap between the two phases. In this section, we investigate more deeply to explore what will happen if the operational gap is bridged.
• Factual Learning. Learn with L(θ; f (x i ), y i ) as objective, i.e., to minimize the loss between factual predictions and ground-truth labels. Then, inference via counterfactual predictions.
• Counterfactual Learning. Learn with L(θ; c(x i ), y i ) as objective, i.e., to minimize the loss between counterfactual predictions and ground-truth labels. Then, inference directly.
The average results of TEXTCNN on ECO (|Y|=2) and CHE (|Y|=13) are reported in Figure 2. We observe that these configurations converge at different F 1 scores as the number of epochs increases gradually. As for each dataset, the configuration of a factual model with counterfactual inference (i.e., CORSAIR) achieves the best performance with even a relatively more rapid convergence. More interestingly, in the early phases of model training (e.g., epoch=0), COR-SAIR usually provides a higher starting point than traditional factual inference. We conjecture that the superiority may come from the use of average embedding which usually produces a stable distribution similarly distributed with ideal biases, making a base model happen to "see" the label bias once the initiation operation is done. This phenomenon is empirically held, especially for smallscale classification tasks.  Figure 2: The average results of three types of different learning paradigms on two datasets, including a factual learning with factual inference, a factual learning with counterfactual inference (i.e., CORSAIR) and a counterfactual learning with direct inference.
Surprisingly, counterfactual learning converges at the factual learning case. This finding consistently holds on all other baselines across datasets, which means that the so-called counterfactual learning actually degrades to a factual inference. This indicates that if a training model explicitly mitigates two types of dataset biases in an end-toend fashion, i.e., without the operational gap, it actually loses the function to perform debiased inference. The important reason is that under such circumstance, the potential biases actually "spread" throughout the whole model architecture, instead of the mere part before bias removal is operated, which makes bias removal only look like debiasing but is just a factual feedforward operation that is unable to capture, distill and even mitigate biases. Therefore, the counterfactual inference works only when the operational gap between learning and inferencing exists. This beneficial gap instead makes the biases spread only throughout the part before the bias removal module, and thus enables them to be distilled via counterfactual inference.

Conclusion
We have designed a counterfactual framework for text classification debiasing. Extensive experiments demonstrated the framework's good effectiveness, generalizability and fairness. Future work will design a joint-learning technique to dynamically decide each document's main content. We hope the paradigm can illuminate a promising technical direction of causal inference in natural language processing.