Counterfactual Augmentation for Multimodal Learning Under Presentation Bias

In real-world machine learning systems, labels are often derived from user behaviors that the system wishes to encourage. Over time, new models must be trained as new training examples and features become available. However, feedback loops between users and models can bias future user behavior, inducing a presentation bias in the labels that compromises the ability to train new models. In this paper, we propose counterfactual augmentation, a novel causal method for correcting presentation bias using generated counterfactual labels. Our empirical evaluations demonstrate that counterfactual augmentation yields better downstream performance compared to both uncorrected models and existing bias-correction methods. Model analyses further indicate that the generated counterfactuals align closely with true counterfactuals in an oracle setting.


Introduction
Deployment of machine learning models is ubiquitous in the real world, ranging from web search ranking to movie recommendation.To ensure good performance, new models must be trained periodically, since new training examples may become available, and the types of features that are collected can evolve over time (e.g., from tabular to multimodal data).For user-facing models like recommenders, labels are often derived from user behaviors that the model wishes to encourage, and user-model interactions continuously produce new data that can be used for training models.Modern NLP, for instance, relies heavily on models that learn from user feedback, not the least of which are ChatGPT and other large language models that comprise the current state-of-the-art.
In practice, however, feedback loops between the user and the model can influence future user Figure 1: An illustration of how presentation bias may arise from feedback loops (e.g., in a movie recommendation system).The top sequence depicts uncorrected presentation bias, while the bottom sequence demonstrates how our method, counterfactual augmentation, can correct presentation bias.
behavior, inducing presentation bias over the labels (Joachims et al., 2017;Pan et al., 2021).By shifting the label distribution away from users' true preferences, presentation bias compromises the ability to train new models (Schmit and Riquelme, 2018;Krauth et al., 2020).For example, an algorithm may present future content based on a user's interactions with prior content.As the user engages with the algorithm's recommendations or outputs, it will recommend more of that same type of contenteven if there are other types of content the user might also enjoy (Figure 1).
Presentation bias negatively affects the data distribution in two major ways.(1) Bias amplification.New labels are dependent on the prior behavior of the model, so they may not reflect the user's true preferences.This bias will amplify as more training loops are completed on biased data.(2) Label homogenization.As the model learns user behaviors, most users' responses to its recommendations will be positive, so variation in user feedback decreases.
In this paper, we aim to correct the presentation bias resulting from feedback loops.We first propose that presentation bias arises due to the causal relationship between a model's recommendations and a user's behavior, which affects which labels are observed.Users tend to interact with recommended items, so under presentation bias, we are more likely to observe labels for recommended items-while without presentation bias, users would interact with all items (or a random subset), so labels would be observed for the full distribution.We conclude that we can break the causal link behind presentation bias with a counterfactual question: how would users have reacted had they interacted with all items, contrary to reality?
With this idea as our foundation, we introduce counterfactual augmentation (Figure 1), a causal approach for reducing presentation bias using generated counterfactual labels. 1 Because "true" counterfactuals are by definition unknown, counterfactual augmentation leverages the causal relationship between the model's behavior and the user's behavior to generate realistic counterfactual labels.We generate counterfactuals for the labels that are unobserved due to presentation bias, then augment the observed labels with the generated ones.Intuitively, this supplies labels over the full data distribution, yielding a bias-corrected dataset.
We evaluate our method on predictive tasks in language and multimodal settings that reflect realworld presentation bias.We consider data with evolving feature spaces, where over time the features transition from simpler features to richer language or multimodal ones. 2 In our experiments, we demonstrate that counterfactual augmentation effectively corrects presentation bias when training predictive models, outperforming both uncorrected models and existing bias correction methods.We conduct model analyses that examine why counterfactual augmentation is effective for reducing presentation bias and discover that our generated counterfactuals align closely with true counterfactuals in an oracle setting.

Problem Statement
We formalize the problem of presentation bias in machine learning systems in causal terms (Figure 2).These systems usually consume both simple features, such as metadata, and rich features, such as text or images, training on user interactions with different items to produce recommendations.Let t be a time index, and let X t denote the simple features defined over the feature space X.Similarly, let W t denote rich features defined over the feature space W. We denote true user item preferences as Y t ∈ Y and predicted user item preferences as R (for simplicity, we can think of these as binary recommendations).Finally, let A t be an indicator of which items the user interacts with.Due to feature evolution, only X t is observed at earlier time points, while later both X t and W t are observed.Without loss of generality, assume that the feature evolution occurs in the first two time points t = 0 and t = 1.For ease of notation, we do not include time index subscripts when t = 1.
At t = 0, a predictive model R 0 is trained on an observed feature set X 0 and labels Y 0 .This model makes predictions about user preferences for unseen items and recommends items to the user.These recommendations influence A-which of those items the user subsequently interacts withbecause users are much more likely to interact with recommended items, such that P (A = 1|R 0 = 0) ≪ P (A = 1|R 0 = 1).In turn, this induces a presentation bias in the distribution of observed Y , the user's measured preferences at t = 1.Due to the presentation bias, there is a very high probability of observing Y when R 0 = 1 and a very low probability of observing Y when R 0 = 0.
At t = 1, a full set of simple and rich features (X, W ) is observed due to feature evolution.However, because the distribution of Y has been influenced by R 0 , a second model R trained on X, W , and Y will not correctly learn user preferences.
Example.Consider a system that must categorize a user's emails as important and unimportant.X is email metadata, and W is the text of the email.Y is an indicator of whether the user interacted with the email positively (e.g., replied) or negatively (e.g., reported spam).R 0 is a classifier trained on X 0 and Y 0 to label emails as important or unimportant.Users preferentially interact with important emails, so emails with R 0 = 1 have a much higher chance of user interaction (i.e., A = 1) and therefore of having an observed label Y , inducing a bias in Y that depends on R 0 .After R 0 is trained, the system's administrators want to train a new, improved model R using both X and W .However, the bias in Y will affect the ability to train R.

Methods
To eliminate presentation bias, we notice that we must block the causal path between R 0 and Y so that R 0 no longer influences which Y are observed.Because these two variables are linked by the mediator A, we can block the path by controlling for A. To do so, we define the counterfactual Y A=a , the value Y would have taken had A = a.

Counterfactual augmentation
Using Y A=a , we block the path between the recommender R 0 and the label Y with the following intuition.A indicates which items users interact with and thus which labels are observed.We can therefore eliminate the influence of A by generating a synthetic data distribution in which all items receive user interaction and all Y are "observed." Formally, let P (Y ) denote the marginal distribution of the labels and P (X, Y ) denote the joint marginal distribution of the features and the labels.In an unbiased setting, a model f is optimized over data (x, y) ∼ P (X, Y ).Under presentation bias, however, only a portion of P (Y ) is observed: the conditional distribution P (Y |A = 1).Consequently, the model f is trained over data (x, y) ∼ P (X, Y |A = 1), which may lead to con-vergence to a non-optimal solution.Definition 3.1 (Counterfactual augmentation).To correct presentation bias in the data distribution, counterfactual augmentation creates an approximation of the marginal label distribution P (Y ) using the estimated distribution of counterfactual labels Y A=1 , or what Y would have been had A = 1.This allows us to define P CA (Y ), a counterfactually augmented marginal label distribution: Combining labels from P CA (Y ) with the known features, we have P CA (X, Y ), a counterfactually augmented marginal data distribution: From P CA (X, Y ), bias-corrected data can be sampled, such that the model f is now optimized over (x, y) ∼ P CA (X, Y ).Supposing P CA (X, Y ) is a good approximation of P (X, Y ), f should converge to a near-optimal solution.

Multimodal counterfactual GAN
We implement counterfactual augmentation with a generative adversarial network (GAN) capable of generating realistic counterfactual labels given multimodal input data. 3Inspired by the work of Yoon et al. (2018), who propose a GAN (GANITE) specifically for estimating individual causal effects, we generate labels-both factual and counterfactual-with a generator G, then train a discriminator D to distinguish factual from counterfactual labels.Our architecture (Figure 3) extends their work in several core aspects: Mediators.Rather than estimating the direct effect of an intervention A on an outcome Y , we seek to model the indirect effect of a variable R on an outcome Y through the mediator A. We account for both of these dependencies, allowing us to later block the effect of R on Y by intervening on A. In the counterfactual block, a generator G takes multimodal data as input and generates a factual label ỹf and a counterfactual label ỹcf .As the true factual label y f is known, it is used to learn a supervised loss between y f and ỹf that helps to train G.At random, either the true factual label y f or the generated counterfactual label ỹcf is passed to a discriminator D, conditional on the recommendation r corresponding to the label.The discriminator must determine whether the label it has received is factual or counterfactual, and its loss D loss is used to further train both G and D. After the GAN has been trained, its counterfactuals are used to augment data that is used for predictive tasks (e.g., prediction block).
Language and multimodal data.Where GAN-ITE was designed for tabular data only, our implementation handles richer features like text and images.We integrate language and image encoders into the architecture that can be simultaneously fine-tuned as the counterfactual GAN is trained.
Correcting the discriminator constraint.Let Y cf denote a counterfactual label and Y f denote a factual label.The discriminator of GANITE encourages P (Y cf |X) → P (Y f |X), i.e., the estimated distribution of counterfactual labels should converge to the true distribution of factual labels.However, in feedback loops, the label is much more likely to be observed (i.e., factual) when R = 1 than when R = 0. Then the discriminator may enforce P (Y cf |X, R = 0) → P (Y f |X, R = 1), which would mean that labels follow the same distribution regardless of whether R = 1 or R = 0.
We address this problem by defining two separate discriminators-one for each recommendation condition.Each discriminator is arbitrarily passed a factual or counterfactual label from its recommendation condition, and it must identify whether the label is factual or counterfactual.The separate discriminators encourage the realistic constraints

Experiments
We conduct empirical evaluations to assess how well counterfactual augmentation corrects presentation bias, with the aim of improving downstream performance.We evaluate on predictive machine learning tasks, which reflect real-world models' goals of predicting user behavior, and in multiple data settings, both synthetic and real-world.To facilitate detailed analysis of our models, we introduce a procedure for inducing realistic presentation bias in unbiased datasets.All data and code for our experiments will be publicly released.

Datasets
To recreate feature evolution in our experiments, we evaluate on datasets that contain both tabular features and rich features like text or images.We select two datasets from the Multimodal AutoML Benchmark (Shi et al., 2021): Airbnb and Clothing.Airbnb consists of 22,895 Airbnb listings for the city of Melbourne, including metadata, text descriptions, and images of the property.The nightly price of the listing is the label.Clothing comprises 23,486 reviews of women's clothing from an online retailer, with metadata, the title, and the text of the review.The review score is treated as the label.Both labels can be predicted directly via regression, but they can also be discretized to be used in clas-sification tasks (as we do in our evaluation).For our binary classification tasks, we binarize both datasets in a 0-1 proportion of approximately 0.25 to 0.75 to reflect real-world data, in which the majority of feedback received from users is positive.
We further create a synthetic version of the Airbnb dataset (Synthetic) in which the features are taken from the real dataset, but the label is synthesized as a noisy function of the tabular features and the multimodal features.The purpose of this dataset is to evaluate the efficacy of counterfactual augmentation in a "best-case scenario" in which we know that there is some signal about the label that can be gained independently from both the simple features and the rich features.We use a binary label, again to reflect a "best-case scenario" in which the downstream task is relatively easy (compared to multi-class classification or regression), with a 0-1 proportion of approximately 0.25 to 0.75.
Additional details about the datasets are provided in the appendix (Section A.1).

Method for inducing presentation bias
To induce presentation bias in these datasets in a way that will allow for post-hoc model analysis, we use a procedure that mimics feedback loops in real-world systems.We first create three splits of the data, which correspond to the three data batches in Figure 1.We refer to these splits as D original , D train , and D eval .On D original , which has no presentation bias, we fit a model M tab on the labels Y original , using only tabular features.
We use M tab to predict labels R train for D train using tabular features, where R train = M tab (X train ).R train corresponds to R 0 in our causal structure.Next, we drop 90% of the labels from samples in D train where R train = 0 (where Y is multi-class or binary, we use a threshold value instead).This induces presentation bias by creating the causal dependency R 0 → A → Y , where labels are observed with high probability when R 0 = 1 and with low probability when R 0 = 0. We also randomly drop ∼35% of samples from D train with equal probability (reflecting the remaining items that users do not interact with).
Finally, for D eval , we create an unbiased version in which we leave D eval as it is, and a biased version D biased .For D biased , we again use M tab to predict labels R eval using only tabular features, where R eval = M tab (X eval ).We then drop 90% of the samples in D biased where R eval = 0.

Models
Baselines.We compare counterfactual augmentation against several baselines.First, we include a model without bias correction (Uncorrected).To provide the best chance of achieving good performance, we use pre-trained transformer architectures fine-tuned on the respective task datasets: Dis-tilBERT (Sanh et al., 2020) for language and ViT (Dosovitskiy et al., 2021) for images.These models are used as encoders for the text and images of the datasets.Once embeddings are obtained, they are concatenated with the tabular data and passed to a final layer fine-tuned for the predictive task.
Our remaining baselines are implementations of existing methods for correcting presentation bias, both of which we describe further in Section 6.In our experiments, the IPW baseline is implemented identically to the uncorrected baseline; however, when fine-tuning the final task layer, an inverse propensity weighted loss (Wang et al., 2016) is used.The Dragonnet baseline is an adaptation of a method proposed by Shi et al. (2019) for jointly estimating causal treatments and outcomes with a single neural network.To make this method compatible with our data setting, we pre-embed the text and images before passing them to Dragonnet, and we also modify the final layer to output estimated counterfactuals rather than estimated causal effects.
Counterfactual augmentation.In our proposed method, counterfactual augmentation (CA), we train our multimodal counterfactual GAN on a biased dataset, then use the GAN to generate the counterfactual labels for all samples for which labels are not observed.Combining the generated labels with the observed labels, we have a biascorrected dataset.With this bias-corrected data, we encode text and images using fine-tuned Dis-tilBERT and ViT, combine them with the tabular data, and train a final layer for the specific task.
Additional details about the training procedures are provided in the appendix (Section A.2).

Evaluation
In our evaluation, our models are fit on D train (which contains presentation bias) and evaluated on both D eval and D biased .Evaluation on D eval indicates how well our model predicts the label in a setting where presentation bias is not a factor (i.e., if we knew all labels).Evaluation on D biased indicates how well our model predicts the label given the data that is available to us in reality.We note that as a consequence of presentation bias, any class or distribution imbalance in the labels Y will be amplified, since there is a positive relationship between the predicted labels R and the true labels Y .This imbalance reflects the real-world tendency for users to like their recommendations and for positive labels to dominate.Therefore, in classification tasks, overall accuracy and F 1 score will be artificially high for a model that simply predicts the most common class.Important measures of success for a method will instead be F 1mac , or macro F 1 score (F 1 score uniformly weighted across all classes), and for binary classification, F 1 min , or F 1 score on the minority class.

Prediction task results
We evaluate counterfactual augmentation against our baselines in the Synthetic, Airbnb, and Clothing data settings on binary classification tasks (Tables 1 and 2) and multi-class classification and regression tasks (Tables 3 and 4)."Improvement" is computed by taking the difference between the CA score and the score of the next-best method for that metric, which is generally IPW.
We observe that when evaluating in an unbiased setting (which reflects "true" preferences), counterfactual augmentation offers the best performance across all metrics for all tasks on all datasets, often by a significant margin.It outperforms not only the uncorrected baseline but also both bias-correction baselines, IPW and Dragonnet.When evaluating in a biased setting (which reflects the evaluation data we have available in reality), counterfactual augmentation also improves performance across metrics, tasks, and datasets.In general, it outperforms competing bias-correction methods, with the single exception being the binary classification task for the Clothing Review dataset, where it does not achieve as much improvement as IPW but still offers substantial gains over the uncorrected model.
Importantly, the biggest improvements resulting from counterfactual augmentation are in the minority classes.As we mention previously, due to the imbalance in the distribution of Y , macro and minority class F 1 score are the best measures of performance.Furthermore, since the generated counterfactual labels correspond largely to the minority classes, the relatively high minority class F 1 scores suggest that the generated counterfactuals are sufficiently realistic to allow the model to learn.
Taken together, these results suggest that counterfactual augmentation is indeed successful in correcting presentation bias-and that it does a better job than existing bias-correction methods.From an empirical standpoint, counterfactual augmentation is both useful and stable across settings and tasks, yielding consistently good performance.

Model analysis: Why does counterfactual augmentation work?
Counterfactual augmentation produces clear empirical gains in downstream performance over both uncorrected models and existing bias-correction methods.In this section, we analyze our generated counterfactuals to better understand these improvements.Although true counterfactuals are never known in the real world, in these experiments we do have access to the true counterfactuals of D train , which we withheld in the process of inducing presentation bias.We use these as a basis for comparison with our generated counterfactuals.
Comparing counterfactual distributions.To assess how well our generated counterfactuals correspond to real counterfactuals, we plot their distributions together for each combination of dataset and task (Figure 4).We observe that for the easier binary classification task, the distribution of generated counterfactuals closely reflects that of the true counterfactuals across all data settings.On these tasks, it appears that the generated counterfactuals are a good approximation of the true counterfactuals.For the more difficult multi-class classification and regression tasks, the difference between the generated and true distributions is greater.
Reducing presentation bias helps correct the la-bel imbalance that exists in the overall label distribution (Figure 5).However, the generated counterfactual distribution also tends to be more uniform compared to the true counterfactual distribution (seen in 4 out of 6 plots in Figure 4).Therefore, even aside from the reduced presentation bias, the greater uniformity of the generated counterfactuals may further correct label imbalance.In general, we observe that the bias-corrected label distribution P CA (Y ) is more balanced than the uncorrected label distribution P (Y |A = 1).This reduction in label imbalance better enables a model to learn on the bias-corrected set.
Performance of an oracle.Because we have access to the true counterfactuals, we can train an oracle model over an unbiased version of D train .By comparing the oracle to counterfactual augmentation, we can determine how well counterfactual augmentation recovers performance compared to the original unbiased data.We report the results of the oracle in Tables 1 through 4.
For most tasks and data settings, we observe that-as expected-counterfactual augmentation still results in some loss of performance compared to the oracle. 4However, the performance gap between CA and the oracle is generally substantially less than the performance gap between CA and the next-best bias-correction method.These results suggest that although CA constitutes a significant improvement over existing methods, further refinement of the counterfactual generation method may be able to yield even better results.

Related Work
Presentation bias may be considered a type of selection bias, in which the sampling distribution differs from the population distribution.Selection bias is a core challenge of observational causal inference, where the causal effect of a treatment A on an outcome Y is estimated not from a randomized trial but from observed data.Since the treatment assignment mechanism is not random, it must be accounted for during estimation.
One common method for addressing selection bias in causal inference is inverse propensity  weighting (IPW) (Robins et al., 1994;Hernán and Robins, 2006).At a high level, IPW up-weights samples corresponding to treatment conditions that are unlikely to be observed and down-weights samples corresponding to treatment conditions that are likely to be observed, such that all treatment conditions appear to be equally likely over the data distribution.This blocks the causal path between the treatment assignment and the outcome.
IPW for presentation bias correction.Using this principle, a number of works propose an inverse propensity weighted empirical loss function that can be used to reduce the effects of presentation bias when training a model on biased data (Wang et al., 2016;Schnabel et al., 2016;Joachims et al., 2017).Several works also engage with IPW in more complex ways.Krauth et al. (2022) address a longitudinal bias setting and propose an algorithm that maximizes the desired outcome at each time step using an IPW-based estimator.Shi et al. (2019) introduce Dragonnet, a fully-connected multi-head neural network that jointly predicts the treatment and the outcome, simultaneously yielding both a propensity score estimate and a predicted outcome.
Task-based presentation bias correction.Because presentation bias can appear in many task settings, there exist a number of task-specific approaches for reducing presentation bias.In information retrieval, for example, unbiased learners of click data (Ai et al., 2018) and propensity-weighted rank metrics (Agarwal et al., 2019) have been proposed, while in the recommender literature, methods have been developed for the matrix factorization setting (Bonner and Vasile, 2018;Wang et al., 2020).However, the task-specific nature of these methods limits their generalizability compared to counterfactual augmentation.
Estimating counterfactuals.The inability to know an individual's counterfactual is a central challenge of causal inference.However, recent works in the deep learning literature have made large inroads toward estimating individual treatment effects (Shalit et al., 2017;Louizos et al., 2017;Yoon et al., 2018), which is an adjacent task to estimating individual counterfactuals.We draw upon this body of work as a basis for obtaining high-quality counterfactuals.
Counterfactuals in NLP.Our work is contextualized within a recent body of research that has shown that counterfactuals are an effective supplement to training data when learning language models (Wang and Culotta, 2021;Qian et al., 2021;Yang et al., 2021;Howard et al., 2022).Existing works largely rely on manually created counterfactuals or programmatically generated counterfactuals.Our method advances beyond prior works by leveraging the causal mechanism behind the missing portions of the data distribution to efficiently generate targeted, high-quality counterfactuals.

Conclusion
In this paper, we introduced counterfactual augmentation, a causal method for correcting presentation bias using generated counterfactuals.We described the causal mechanism behind presentation bias in real-world machine learning systems that rely on user feedback, and we explained the causal reasoning behind counterfactual augmentation.We presented empirical evaluations using counterfactual augmentation to reduce presentation bias, and we found that our approach significantly outperforms existing methods.Finally, we conducted a model analysis to explore why counterfactual augmentation is effective in addressing presentation bias.Given the prevalence of presentation bias in real-world deployments of machine learning models, our findings suggest that counterfactual augmentation has the potential to improve the quality of user-facing machine learning models across many types of applications.

Acknowledgements
This material is based upon work partially supported by Microsoft, the National Science Foundation (awards 1722822 and 1750439), and the National Institutes of Health (awards R01MH125740, R01MH132225, R01MH096951, and R21MH130767).Victoria Lin is partially supported by a Meta Research PhD Fellowship.Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors, and no official endorsement should be inferred.We are grateful to Alan Thomas and Sebastian de la Chica for many helpful discussions and feedback.

Limitations
The most significant limitation of counterfactual augmentation is its requirement that the generated counterfactuals be sufficiently close to true counterfactuals; otherwise, the counterfactually augmented data distribution P CA (X, Y ) is not a good approximation of the true data distribution P (X, Y ).If poor-quality counterfactuals are produced and P CA (X, Y ) is very different from P (X, Y ), counterfactual augmentation could instead hurt models that are trained on the augmented data.Although our multimodal counterfactual GAN generates high-quality counterfactuals for the tasks and data settings that we evaluate, we do not know if this will be the case across every task and data setting.A different counterfactual estimation method may be required depending on the particular problem.
Based on failure modes of causal effect estimation in statistical causal inference, we hypothesize that lower-quality counterfactuals may be produced if: • The causal mechanism of presentation bias is misspecified.
• The feature data is very noisy or sparse, making it difficult to learn counterfactuals.
• The counterfactual generation model does not have enough capacity to model the data (could be more of a problem for "traditional" statistical linear models).

Ethics Statement
Broader impact.Deep learning models have been shown to perpetuate and even amplify the biases in their training data (Bolukbasi et al., 2016;Swinger et al., 2019;Caliskan et al., 2017).Often, these biases manifest in a similar way to presentation bias: that is, only a portion of the theoretical data distribution is contained in the model's training dataset, which impacts what the model learns.
Therefore, we believe that counterfactual augmentation may be helpful not only in correcting presentation bias but also in reducing social biases in data.In principle, counterfactual augmentation can be used to correct any type of bias for which the causal mechanism is known.The causal mechanism is used to generate counterfactuals, which augment the unobserved portion of the data distribution.Consequently, counterfactual augmentation may also be helpful in correcting social biases and helping make data more fair.
Ethical considerations.When used in conjunction with multimodal data, as it is in this paper, counterfactual augmentation relies in part on large pre-trained models to generate counterfactuals.As a result, it is also possible that the generated counterfactuals themselves may encode the biases contained in large pre-trained models.Users should be cautious when employing counterfactual augmentation in sensitive settings or when using it to reduce biases on protected attributes.
Additionally, we acknowledge the environmental impact of large language and image models, which are used in this work.

Figure 3 :
Figure3: Diagram of our multimodal counterfactual GAN architecture.In the counterfactual block, a generator G takes multimodal data as input and generates a factual label ỹf and a counterfactual label ỹcf .As the true factual label y f is known, it is used to learn a supervised loss between y f and ỹf that helps to train G.At random, either the true factual label y f or the generated counterfactual label ỹcf is passed to a discriminator D, conditional on the recommendation r corresponding to the label.The discriminator must determine whether the label it has received is factual or counterfactual, and its loss D loss is used to further train both G and D. After the GAN has been trained, its counterfactuals are used to augment data that is used for predictive tasks (e.g., prediction block).

Figure 4 :
Figure 4: Comparison of the distributions of the generated counterfactuals and the true counterfactuals.

Figure 5 :
Figure 5: Comparison of uncorrected label distributions and label distributions after bias-correction with CA.
Figure2: Proposed mechanism of presentation bias.X t , W t , and Y t denote simple features, rich features, and labels at time t (no subscript for t = 1), while R t is a model (e.g., a recommender) trained over the input features and labels.A indicates which items a user interacts with.

Table 2 :
Results on binary classification tasks (biased evaluation dataset).