Types of Out-of-Distribution Texts and How to Detect Them

Despite agreement on the importance of detecting out-of-distribution (OOD) examples, there is little consensus on the formal definition of the distribution shifts of OOD examples and how to best detect them. We categorize these examples as exhibiting a background shift or semantic shift, and find that the two major approaches to OOD detection, calibration and density estimation (language modeling for text), have distinct behavior on these types of OOD data. Across 14 pairs of in-distribution and OOD English natural language understanding datasets, we find that density estimation methods consistently beat calibration methods in background shift settings and perform worse in semantic shift settings. In addition, we find that both methods generally fail to detect examples from challenge data, indicating that these examples constitute a different type of OOD data. Overall, while the categorization we apply explains many of the differences between the two methods, our results call for a more explicit definition of OOD to create better benchmarks and build detectors that can target the type of OOD data expected at test time.


Introduction
Current NLP models work well when the training and test distributions are the same (e.g. from the same benchmark dataset). However, it is common to encounter out-of-distribution (OOD) examples that diverge from the training data once the model is deployed to real settings. When training and test distributions differ, current models tend to produce unreliable or even catastrophic predictions that hurt user trust (Ribeiro et al., 2020). Therefore, it is important to identify OOD inputs so that we can modify models' inference-time behavior by abstaining, asking for human feedback, or gathering additional information (Amodei et al., 2016).
Current work in NLP either focuses on specific tasks like intent classification in taskoriented dialogue (Zheng et al., 2020), or arbitrary in-distribution (ID) and OOD dataset pairs * Work done while at New York University. ( Hendrycks et al., 2020bHendrycks et al., , 2019Zhou and Chen, 2021), e.g. taking a sentiment classification dataset as ID and a natural language inference dataset as OOD. However, getting inputs intended for a different task is rare in realistic settings as users typically know the intended task. In practice, an example is considered OOD due to various reasons, e.g. being rare (Sagawa et al., 2020), out-ofdomain (Daumé III, 2007), or adversarial (Carlini and Wagner, 2017). This broad range of distribution shifts makes it unreasonable to expect a detection algorithm to work well for arbitrary OOD examples without assumptions on the test distribution (Ahmed and Courville, 2020). In this paper, we categorize OOD examples by common types of distribution shifts in NLP problems inspired by Ren et al. (2019) and Hsu et al. (2020). Specifically, we assume an input (e.g. a movie review) can be represented as background features (e.g. genre) that are invariant across different labels, and semantic features (e.g. sentiment words) that are discriminative for the prediction task. Correspondingly, at test time we consider two types of OOD examples characterized by a major shift in the distribution of background and semantic features, respectively. While the two types of shifts often happen simultaneously, we note that there are realistic settings where distribution shift is dominated by one or the other. For example, background shift dominates when the domain or the style of the text changes (Pavlick and Tetreault, 2016), e.g. from news to tweets, and semantic shift dominates when unseen classes occur at test time, as in open-set classification (Scheirer et al., 2013). 1 We use this categorization to evaluate two major approaches to OOD detection, namely calibration methods that use the model's prediction confidence (Hendrycks and Gimpel, 2017; and density estimation methods that fit a distribution of the training inputs (Nalisnick et al., 2019a;Winkens et al., 2020;Kirichenko et al., 2020). We show that the two approaches make implicit assumptions on the type of distribution shift, and result in behavioral differences under each type of shift. By studying ID/OOD pairs constructed from both simulations and real datasets, we find that the density estimation method better accounts for shifts in background features, consistently outperforming the calibration method on background shift pairs. We further see the opposite in semantic shift pairs, with the calibration method consistently yielding higher performance.
In addition, we analyze the detection performance on challenge datasets (McCoy et al., 2019a;Naik et al., 2018b) through the lens of background/semantic shift. We find that these challenge datasets provide interesting failure cases for both methods. Calibration methods completely fail when the model is over-confident due to spurious semantic features. While density estimation methods are slightly more robust, language models are easily fooled by repetitions that significantly increase the probability of a piece of text. Together, our findings suggest that better definitions of OOD and corresponding evaluation datasets are required for both model development and fair comparison of OOD detection methods. 1 We exclude task shift where the OOD examples are from a different task, e.g. textual entailment inputs for a text classification model, because it is less likely to happen in realistic settings where users are often aware of the intended use of the model.

Problem Statement
Consider classification tasks where each example consists of an input x ∈ X and its label y ∈ Y. In the task of OOD detection, we are given a training dataset D train of (x, y) pairs sampled from the training data distribution p(x, y). At inference time, given an input x ∈ X the goal of OOD detection is to identify whether x is a sample drawn from p(x, y).

Types of Distribution Shifts
As in (Ren et al., 2019), we assume that any representation of the input x, φ(x), can be decomposed into two independent and disjoint components: the background features φ b (x) ∈ R m and the semantic features φ s (x) ∈ R n . Formally, we have Note that p refers to the ground truth distribution, as opposed to one learned by a model. Intuitively, the background features consist of population-level statistics that do not depend on the label, whereas the semantic features have a strong correlation with the label. A similar decomposition is also used in previous work on style transfer (Fu et al., 2018), where a sentence is decomposed into the content (semantic) and style (background) representations in the embedding space.
Based on this decomposition, we classify the types of OOD data as either semantic or background shift based on whether the distribution shift is driven by changes in φ s (x) or φ b (x), respectively. An example of background shift is a sentiment classification corpus with reviews from IMDB versus GoodReads where phrases indicating positive reviews (e.g. "best", "beautifully") are roughly the same while the background phrases change significantly (e.g. "movie" vs "book"). On the other hand, semantic shift happens when we encounter unseen classes at test time, e.g. a dialogue system for booking flight tickets receiving a request for meal vouchers (Zheng et al., 2020), or a question-answering system handling unanswerable questions (Rajpurkar et al., 2018). We note that the two types of shifts may happen simultaneously in the real world, and our categorization is based on the most prominent type of shift.

OOD Detection Methods
To classify an input x ∈ X as ID or OOD, we produce a score s(x) and classify it as OOD if s(x) < γ, where γ is a pre-defined threshold. Most methods differ by how they define s(x). Below we describe two types of methods commonly used for OOD detection.
Calibration methods. These methods use the model's prediction confidence as the score. A well-calibrated model's confidence score reflects the likelihood of the predicted label being correct. Since the performance on OOD data is usually lower than on ID data, lower confidence suggests that the input is more likely to be OOD. The simplest method to obtain the confidence score is to directly use the conditional probability produced by a probabilistic classifier p model , referred to as maximum softmax probability (MSP; Hendrycks and Gimpel, 2017). Formally, While there exist more sophisticated methods that take additional calibration steps , MSP proves to be a strong baseline, especially when p model is fine-tuned from pretrained transformers (Hendrycks et al., 2020b;Desai and Durrett, 2020).
Density estimation methods. These methods use the likelihood of the input given by a density estimator as the score. For text or sequence data, a language model p LM is typically used to estimate p(x) (Ren et al., 2019). To avoid bias due to the length of the sequence (see analysis in Appendix A), we use the token perplexity (PPL) as the score. Formally, given a sequence x = (x 1 , . . . , x T ), While there are many works on density estimation methods using flow-based models in computer vision (e.g. Nalisnick et al., 2019a;Zhang et al., 2020a), there is limited work experimenting with density estimation methods for OOD detection on text (Lee et al., 2020).
Implicit assumptions on OOD.
One key question in OOD detection is how the distribution shifts at test time, i.e. what characterizes the difference between ID and OOD examples. Without access to OOD data during training, the knowledge must be incorporated into the detector through some inductive bias. Calibration methods rely on p(y | x) estimated by a classifier, thus they are more influenced by the semantic features which are correlated with the label. We can see this formally by In contrast, density estimation methods are sensitive to all components of the input, including both background and semantic features, even in situations where distribution shifts are predominately driven by one particular type. In the following sections, we examine how these implicit assumptions impact performance on different ID/OOD pairs.

Simulation of Distribution Shifts
As an illustrative example, we construct a toy OOD detection problem using a binary classification setting similar to the one depicted in Figure 1. This allows us to remove estimation errors and study optimal calibration and density estimation detectors under controlled semantic and background shifts.

Data Generation
We generate the ID examples from a Gaussian Mixture Model (GMM): The centroids are sets of semantic and background features such that µ 1 = [µ s , µ b ] and µ 0 = [−µ s , µ b ], where µ s ∈ R n and µ b ∈ R m . In the 2D case in Figure 1, this corresponds to the two Gaussian clusters where the first component is the semantic feature and the second is the background feature.
In this case, we know the true calibrated score p(y | x) and the true density p(x) given any inputs.
Specifically, the optimal classifier is given by the Linear Discriminant Analysis (LDA) predictor. By setting Σ to the identity matrix, it corresponds to a linear classifier with weights [2µ s , 0 b ], where 0 b ∈ R m is a vector of all 0s. For simplicity, we set µ s = 1 s and µ b = 0 b , where 1 s ∈ R n , 0 b ∈ R m are vectors of all 0s.

Semantic Shift
We generate sets of OOD examples using a semantic shift by varying the overlap of ID and OOD semantic features. Formally, we vary the overlap rate r such that where µ s , µ Shift s ∈ R n are the set of semantic features for ID and OOD, respectively, µ s ∩ µ Shift s represents the common features between the two, and | · | denotes the number of elements.
We fix the total dimensions to n + m = 200 and set n = 40 (semantic features) and m = 160 (background features). Further, we vary r by increments of 10%. Larger r indicates stronger semantic shift. For each r, we randomly sample ID and OOD semantic features and report the mean over 20 trials with 95% confidence bands in Figure 2.

Background Shift
We generate sets of OOD examples using a background shift by applying a displacement vector z = [0 s , z b ] to the two means. Formally, where 0 s ∈ R n is a vector of all 0s.
Note that this shift corresponds to a translation of the ID distribution along the direction of µ b . We set the total dimensions to n + m = 200 while varying the split between semantic (n) and background (m) components by increments of 20. Figure 2 shows the OOD detection performance of our simulated experiment. We use Area Under the Receiver Operating Characteristics (AUROC) as our performance metric.

Simulation Results
We see that the calibration method generally outperforms density estimation. Further, the performance gap between the two methods decreases as both methods approach near-perfect performance under large semantic shifts with no overlap in semantic features, and approach chance under no semantic shift with completely overlapping semantic features. However, the calibration method is unable to improve performance under background shifts in either regime because the background features do not contribute to p(y | x) as the LDA weights are 0 for these components (Section 4.1). We find these results in line with our expectations and use them to drive our intuition when evaluating both types of OOD detection methods for real text data.

Experiments and Analysis
We perform head-to-head comparisons of calibration and density estimation methods on 14 ID/OOD pairs categorized as either background shift or semantic shift, as well as 8 pairs from challenge datasets.

Setup
OOD detectors. Recall that the calibration method MSP relies on a classifier trained on the ID data. We fine-tune the RoBERTa  model on the ID data and compute its prediction probabilities (see Equation (5)). For the density estimation method PPL, we fine-tune GPT-2 (Radford et al., 2019) on the ID data and use perplexity as the OOD score (see Equation (6)). 2 To control for model size of the two methods, we choose RoBERTa Base and GPT-2 Small , which have 110M and 117M parameters, respectively. We also experiment with two larger models, RoBERTa Large and GPT-2 Medium with 355M and 345M parameters, respectively. We evaluate the OOD detectors by AUROC and the False Alarm Rate at 95% Recall (FAR95), which measures the misclassification rate of ID examples at 95% OOD recall. Both metrics show similar trends (see Appendix B for FAR95 results).
Training details. For RoBERTa, we fine-tune the model for 3 epochs on the training split of ID data with a learning rate of 1e-5 and a batch size of 16. For GPT-2, we fine-tune the model for 1 epoch on the training split of ID data for the language modeling task, using a learning rate of 5e-5 and a batch size of 8. 3 Oracle detectors. To get an estimate of the upper bound of OOD detection performance, we consider the situation where we have access to the OOD data and can directly learn an OOD classifier. Specifically, we train a logistic regression model with bag-of-words features using 80% of the test data and report results on the remaining 20%.

Semantic Shift
Recall that the distribution of discriminative features changes in the semantic shift setting, i.e. p train (φ s (x)) = p test (φ s (x)) (Section 2). We create semantic shift pairs by including test examples from classes unseen during training. Thus, semantic features useful for classifying the training data are not representative in the test set.
We use the News Category (Misra, 2018) and DBPedia Ontology Classification (Zhang et al., 2015) multiclass classification datasets to create two ID/OOD pairs. The News Category dataset consists of HuffPost news data. We use the examples from the five most frequent classes as ID (News Top-5) and the data from the remaining 36 classes as OOD (News Rest). The DBPedia Ontology Classification dataset consists of data from Wikipedia extracted from 14 non-overlapping classes of DBPedia 2014 (Lehmann et al., 2015).
We use examples from the first four classes by class number as ID (DBPedia Top-4) and the rest as OOD (DBPedia Rest). Results. Table 1 shows the results for our semantic shift pairs. The calibration method consistently outperforms the density estimation method, indicating that calibration methods are better suited for scenarios with large semantic shifts, which is in line with our simulation results (Section 4).

Background Shift
Recall that background features (e.g. formality) do not depend on the label. Therefore, we consider domain shift in sentiment classification and natural language inference (NLI) datasets. For our analysis, we use the SST-2 (Socher et al., 2013), IMDB (Maas et al., 2011), and Yelp Polarity (Zhang et al., 2015) binary sentiment classification datasets. The SST-2 and IMDB datasets consist of movie reviews with different lengths. Meanwhile, the Yelp polarity dataset contains reviews for different businesses, representing a domain shift from SST-2 and IMDB. Each of these datasets is used as ID/OOD, using the validation split of SST-2 and test split of IMDB and Yelp Polarity for evaluation.
We also use the SNLI (Bowman et al., 2015), MNLI (Williams et al., 2018) and RTE (from GLUE, Wang et al., 2018a) datasets. SNLI and MNLI consist of NLI examples sourced from different genres. RTE comprises of examples sourced from a different domain. Where there is some change in semantic information since the task has the two labels (entailment and non-entailment) as opposed to three (entailment, neutral and contradiction) in SNLI and MNLI, 4 domain/background shift is more prominent since the semantic features for the NLI task are similar. Each of these datasets is used as either ID or OOD, and we use the validation set of the OOD data for evaluation.  The density estimation method consistently outperforms the calibration method (for all pairs except MNLI vs RTE), indicating that PPL is more sensitive to changes in background features. Further, in cases where the discriminative model generalizes well (as evident by the small difference in ID and OOD accuracy numbers), we find that the calibration method performance is close to random (50) because a well-calibrated model also has higher confidence on its correct OOD predictions. We note that the discriminative models tend to generalize well here, hence it might be better to focus on domain adaptation instead of OOD detection when the shift is predominantly a background shift. We discuss this further in Section 6.

Analysis
Controlled distribution shifts. We use two controlled distribution shift experiments on real text data to further study the framework of semantic and background shifts. For background shift, we append different amounts of text from Wikitext (Merity et al., 2017) (25,50,100,150,200) words. For semantic shift, we use the News Category dataset and move classes from ID to OOD. We start with the top 40 ID classes by frequency and move classes in increments of 10. The ID coverage of semantic information decreases as more classes move to the OOD subset, resulting in a larger semantic shift.  Results. Figure 3 shows the AUROC score obtained from both methods for our controlled distribution shift experiments. We see that the density estimation method is more sensitive to the amount of synthetic background text than calibration methods, and that the calibration method is more sensitive to the number of ID/OOD classes. This is in line with our intuition about the shifts and the results we obtain from simulated data (Section 4).
Larger models. Table 3 shows the results using larger models for OOD detection. We observe that the larger discriminative model achieves a much higher score for the background shift pair, closing the gap with the language model performance.
We speculate that the larger model is able to learn some of the background features in its representation. The performance for the semantic shift pair is largely unchanged when using the larger models.

Challenge Data
Challenge datasets are designed to target either superficial heuristics adopted by a model (e.g. premise-hypothesis overlap) or model deficiencies (e.g. numerical reasoning in NLI), which creates significant challenges for deployed models (Ribeiro et al., 2020). It is therefore desirable to abstain on detected OOD examples. We consider the following challenge datasets.
Human-generated challenge data. Kaushik et al. (2020) crowdsourced a set of counterfactuallyaugmented IMDB examples (c-IMDB) by instructing annotators to minimally edit examples to yield counterfactual labels. This changes the distribution of semantic features with high correlation to labels such that p train (φ s (x)) = p test (φ s (x)), creating a semantic shift. We consider IMDB as ID and c-IMDB as OOD, combining the training, validation, and test splits of c-IMDB for evaluation.
Rule-based challenge data. HANS (McCoy et al., 2019b) consists of template-based examples that have high premise-hypothesis overlap but are non-entailment, which mainly results in background shift due to the specific templates/syntax. Similarly, the Stress Test dataset (Naik et al., 2018a) is a set of automatically generated examples designed to evaluate common errors from NLI models. We categorize the type of distribution shifts from these test categories with respect to MNLI (ID) depending on whether they append "background" phrases to the ID examples or replace discriminative phrases (Table 4). Antonym (changing premise to obtain an antonymous hypothesis resulting in contradiction despite high overlap) and Numerical Reasoning (different semantic information than MNLI training set) constitute semantic shifts, as the set of semantic features now focus on specific types of entailment reasoning (e.g. antonymy and numerical representation). Negation (appending "and false is not true" to hypothesis), Spelling Errors (randomly introducing spelling errors in one premise word), Word Overlap (appending "and true is true" to each hypothesis), and Length Mismatch (appending a repetitive phrase "and true is true" five times to the premise) constitute background shifts because they introduce population level changes (e.g. appending "and true is true" to each hypothesis) that are unrelated to the entailment conditions of each example.

Failure case 1: spurious semantic features.
Challenge data is often constructed to target spurious features (e.g. premise-hypothesis overlap for NLI) that are useful on the training set but do not correlate with the label in general, e.g. on the test set. Therefore, a discriminative model would be over-confident on the OOD examples because the spurious semantic features that were discriminative during training, while still prominent, are no longer predictive of the label. As a result, in Table  4, MSP struggles with most challenge data, achieving an AUROC score close to random (50). On the other hand, the density estimation method achieves almost perfect performance on HANS.
Failure case 2: small shifts. While density estimation methods perform better in background shift settings, our simulation results show that they still struggle to detect small shifts when the ID and OOD distributions largely overlap. Table 4 shows similar findings for Negation and Word Overlap Stress Test categories that append short phrases (e.g. "and true is true") to each ID hypothesis.
Failure case 3: repetition. For Antonym, Numerical Reasoning, and Length Mismatch, PPL performance is significantly worse than random, indicating that our language model assigns higher likelihoods to OOD than ID examples. These challenge examples contain highly repetitive phrases (e.g. appending "and true is true" five times in Length Mismatch, or high overlap between premise and hypothesis in Numerical Reasoning and Antonym), which is known to yield high likelihood under recursive language models (Holtzman et al., 2020). Thus repetition may be used as an attack to language model-based OOD detectors.
Overall, the performance of both methods drops significantly on the challenge datasets. Among these, human-generated counterfactual data is the most difficult to detect, and rule-based challenge data can contain unnatural patterns that cause unexpected behavior.

Discussion
The performance of calibration and density estimation methods on OOD examples categorized along the lines of semantic and background shift provides us with insights that can be useful in improving OOD detection. This framework can be used to build better evaluation benchmarks that focus on different challenges in OOD detection. A choice between the two methods can also be made based on the anticipated distribution shift at test time, i.e, using calibration methods when detecting semantic shift is more important, and using density estimation methods to detect background shifts. However, we observe failure cases from challenge examples, with density estimation methods failing to detect OOD examples with repetition and small shifts, and calibration methods failing to detect most challenge examples. This indicates that these challenge examples constitute a type of OOD that target the weaknesses of both approaches. This highlights the room for a more explicit definition of OOD to progress the development of OOD detection methods and create benchmarks that reflect realistic distribution shifts.

Related Work
Distribution shift in the wild. Most early works on OOD detection make no distinctions on the type of distribution shift observed at test time, and create synthetic ID/OOD pairs using different datasets based on the setup in Hendrycks and Gimpel (2017). Recently, there is an increasing interest in studying real-world distribution shifts (Ahmed and Courville, 2020;Hsu et al., 2020;Hendrycks et al., 2020a;Koh et al., 2020a). On these benchmarks with a diverse set of distribution shifts, no single detection method wins across the board. We explore the framework of characterization of distribution shifts along the two axes of semantic shift and background (or non-semantic) shift, shedding light on the performance of current methods.
OOD detection in NLP. Even though OOD detection is crucial in production (e.g. dialogue systems (Ryu et al., 2018)) and high-stake applications (e.g. healthcare (Borjali et al., 2020)), it has received relatively less attention in NLP until recently. Recent works evaluated/improved the calibration of pretrained transformer models (Hendrycks et al., 2020b;Goyal and Durrett, 2020;Kong et al., 2020;Zhou and Chen, 2021). They show that while pretrained transformers are better calibrated, making them better at detecting OOD data than previous models, there is scope for improvement. Our analysis reveals one limitation of calibration-based detection when faced with a background shift. Other works focus on specific tasks, including prototypical network for low-resource text classification (Tan et al., 2019) and data augmentation for intent classification (Zheng et al., 2020).
Inductive bias in OOD detection. Our work shows that the effectiveness of a method largely depends on whether its assumption on the distribution shift matches the test data. One straightforward way to incorporate prior knowledge on the type of distribution shift is through augmenting similar OOD data during training, i.e., the so-called outlier exposure method (Hendrycks et al., 2019), which has been shown to be effective on question answering (Kamath et al., 2020). Given that the right type of OOD data can be difficult to obtain, another line of work uses a hybrid of calibration and density estimation methods to achieve a balance between capturing semantic features and background features. These models are usually trained with both a discriminative loss and a generative (or selfsupervised) loss (Winkens et al., 2020;Zhang et al., 2020a;Nalisnick et al., 2019b).

Domain adaptation versus OOD detection.
There are two ways of handling the effect of OOD data: 1) build models that perform well across domains (i.e., background shifts), i.e., domain adaptation (Chu and Wang, 2018;Kashyap et al., 2021) or 2) allow models to detect a shift in data distribution, and potentially abstain from making a prediction. In our setting (2), we want to guard against all types of OOD data without any access to it, unlike domain adaptation which usually relies on access to OOD data. This setting can be more important than (1) for safety-critical applications, such as those in healthcare, because the potential cost of an incorrect prediction is greater, motivating a more conservative approach to handling OOD data by abstaining. This could also help improve performance in selective prediction (Kamath et al., 2020;Xin et al., 2021).

Conclusion
Despite the extensive literature on outlier and OOD detection, previous work in NLP tends to lack consensus on a rigorous definition of OOD examples, instead relying on arbitrary dataset pairs from different tasks. In our work, we approach this problem in natural text and simulated data by categorizing OOD examples as either background or semantic shifts and study the performance of two common OOD detection methods-calibration and density estimation. For both types of data, we find that density estimation methods outperform calibration methods under background shifts while the opposite is true under semantic shifts. However, we find several failure cases from challenge examples that target model shortcomings.
As explained in Section 2, we assume that φ s and φ b map x to two disjoint sets of components for simplicity. This assumption helps us simplify the framework and compare the two types of detection methods in relation to the two types of shifts. While this simplified framework explains much of the differences between the two methods, failure cases from challenge examples highlight the room for better frameworks and a more explicit definition of OOD to progress the development of OOD detection methods. Such a definition can inform the creation of benchmarks on OOD detection that reflect realistic distribution shifts.
Defining (or at least explicitly stating) the types of OOD examples that predictors are designed to target can also guide future modeling decisions between using calibration and density estimation methods, and help improve detection. Some promising directions include test-time fine-tuning (Sun et al., 2020) and data augmentation (Chen et al., 2020), which can be guided towards a specific type of distribution shift for improved detection performance against it. Finally, the methods we studied work well for one type of shift, which motivates the use of hybrid models (Zhang et al., 2020b;Liu and Abbeel, 2020) that use both calibration and density estimation when both types of shift occur at the same time.

Ethical Considerations
As society continues to rely on automated machine learning systems to make important decisions that affect human lives, OOD detection becomes increasingly vital to ensure that these systems can detect natural shifts in domain and semantics. If medical chat-bots cannot recognize that new disease variants or rare co-morbidities are OOD while diagnosing patients, they will likely provide faulty and potentially harmful recommendations 5 if they don't contextualize their uncertainty. We believe that implementing OOD detection, especially for more challenging but commonly occurring semantic shifts should be part of any long-lasting production model.
In addition, OOD detection can be used to identify and alter model behavior when encountering data related to minority groups. For example, Koh et al. (2020b) present a modified version of the CivilComments dataset (Borkan et al., 2019b), with the task of identifying toxic user comments on online platforms. They consider domain annotations for each comment based on whether the comment mentions each of 8 demographic identities -male, female, LGBTQ, Christian, Muslim, other religions, Black and White. They note that a standard BERTbased model trained using ERM performs poorly on the worst group, with a 34.2 % drop in accuracy as compared to the average. Such models may lead to unintended consequences like flagging a comment as toxic just because it mentions some demographic identities, or in other words, belongs to some domains. Our work can be useful in altering the inference-time behavior of such models upon detection of such domains which constitute a larger degree of background shift. Of course, nefarious agents could use the same pipeline to alter model behavior to identify and discriminate against demographics that display such background shifts. the paper. We also want to thank Diksha Meghwal, Vaibhav Gadodia and Ambuj Ojha for their help with an initial version of the project and experimentation setup.

A Example Probability
We additionally evaluate our density estimation methods using log p(x) as a detection measure. In the case of text, log p(x) is defined as While PPL accounts for varying sequence lengths by averaging word likelihoods over the input sequence, log p(x) does not. Figure 4 shows that this difference significantly impacts performance. With IMDB as the ID data, using log p(x) fails for SST-2, achieving close to 100 FAR95 and near 0 AUROC. We suspect this because IMDB examples are a full paragraph while SST-2 examples are one to two sentences. log p(x) would naturally be smaller for IMDB examples than these OOD examples, resulting in complete failure for simple thresholding methods measured by AUROC.

B FAR95 Results
We additionally evaluate the performance for all experiments using FAR95, which measures the false positive rate at 95% recall. In the context of OOD detection, this measure gives the misclassification rate of ID data at 95% recall of OOD classification, hence a lower value indicates better performance.
Tables 5, 6, 7 and 8 show the results obtained using FAR95 as a metric for the corresponding ID/OOD pairs used earlier. We observe that FAR95 results are in line with AUROC results except for    DBPedia, in which case density estimation methods yield a better result. The difference may be a result of the accumulative nature of AUROC in contrast to FAR95, which is a point measurement.

C Background shift in MNLI Genres
MNLI is a crowd-sourced collection of sentence pairs for textual entailment sourced from 10 genres including Fiction, Government, Slate, Telephone, and Travel. We use examples from these five MNLI genres and separately consider each genre as ID and OOD, using the validation splits for evaluation. Table 9 shows the results for MNLI genres. The discriminative model generalizes well to other genres and we find that the OOD detection performance of calibration method is close to random (50) because of the higher confidence on correct OOD predictions by a well-calibrated model.  Table 9: Performance on background shifts caused by shift in MNLI genre. For each pair, higher score obtained (by PPL or MSP) is in bold. We can see that the density estimation method using PPL significantly outperforms the calibration method.