Pseudo Outlier Exposure for Out-of-Distribution Detection using Pretrained Transformers

For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold. A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers. A comprehensive comparison with state-of-the-art algorithms demonstrates POE's competitiveness on several text classification benchmarks.


Introduction
Pre-trained language models (PLMs) have achieved remarkable success in various natural language processing (NLP) tasks such as questionanswering (Yuan et al., 2019;Brown et al., 2020), sentiment analysis (Clark et al., 2020), and text categorization (Devlin et al., 2019;Yang et al., 2019).While PLMs have become a de-facto standard promoting classification accuracy, recent studies have found that over-parameterized PLMs often produce overconfident predictions for out-of-distribution (OOD) samples (Jiang et al., 2020;Kong et al., 2020).For real-world language applications, these unreliable predictions can confuse users when interpreting the model's decisions.Therefore, language models require the ability to detect OOD samples to instill the reliability in NLP applications.
The task of detecting OOD samples can be formulated as a binary hypothesis test of detecting whether an input data is from in-distribution (ID) or OOD.To detect an outlier data, in machine learning communities, the OOD detection task has been studied for many years (Hendrycks and Gimpel, 2017;Lakshminarayanan et al., 2017;Andersen et al., 2020).The prior works have proposed effective methods, including post-hoc algorithms (Lee et al., 2018b;Sun and Li, 2022), and training a rejection network by exposing the model to external OOD datasets (Hendrycks et al., 2019).
However, existing post-hoc methods usually require a subset of actual OOD samples to tune their hyperparameters (Liang et al., 2018;Sun et al., 2021), especially, Hsu et al. (2020) find that hyperparameters tuned with limited OOD dataset are not generalized to others.Thus, these methods are not feasible in real-world applications; moreover, we often cannot know the entire distribution of OOD datasets.Similarly, training a rejection network not only brings an additional burden for OOD data collection but also may result in sub-par OOD detection performance in deciding which subset of external data to use.Intuitively, OOD examples that are excessively distant from training samples may not help with OOD detection because easy-tolearn outlier features can be discriminated rather trivially.Therefore, a desirable trait for OOD sam-ples to effectively train rejection networks is that the OOD sample does not belong to ID but is sufficiently close to the distribution of ID samples (Lee et al., 2018a).
In this paper, we primarily focus on detecting OOD samples by constructing a surrogate OOD dataset from training samples rather than using external OOD data to train a rejection network.To this end, we propose Pseudo Outlier Exposure (POE) which is a procedure to construct a near-OOD set by erasing tokens with high attention scores in training sentences.A rejection network can then be trained on the training (ID) and constructed OOD datasets to detect OOD samples.Numerical experiments confirm that our procedure indeed generates surrogate OOD data close to ID examples.Accordingly, a rejection network trained on this construction outperforms state-of-the-art OOD detection algorithms on several benchmarks.Our main contributions are: • Our novel method easily constructs a surrogate OOD dataset in an offline manner and can be applied to any ID training data without access to any real OOD sample.
• We demonstrate that the resultant surrogate OOD dataset introduced by POE is sufficiently close to the distribution of ID samples, which results in improvement of OOD detection performance for the rejection network.
• Through comprehensive comparison with stateof-the-art algorithms, we demonstrate POE's competitiveness on several text classification benchmarks.
2 Related Work

Post-hoc Methods
Post-hoc methods can detect an OOD sample by manipulating the features or logits of a pre-trained network without changing the weights of the given network.They have advantages where they do not require re-training a pre-trained classifier to detect OOD samples and can be simply applied in the inference time.Therefore, post-hoc methods can preserve the classification accuracy for the classifier.
To detect OOD data, Hendrycks and Gimpel (2017) propose a simple post-hoc algorithm by thresholding the classifier's maximum softmax probability (MSP).ODIN (Liang et al., 2018) adds two additional strategies, temperature scaling and input preprocessing (adding perturbation to the test input) to the MSP for better separating confidence scores between ID and OOD samples.Treating the distribution of feature vectors of pre-trained models as class-conditional Gaussian distributions, Lee et al. (2018b) suggest the Mahalanobis distance-based confidence scoring rule with statistics of data samples in feature space.Energy (Liu et al., 2020) propose the OOD scoring rule using an energy score that is aligned with the probability density of the logits of a pre-trained network.They demonstrate that the energy-based scoring rule is less susceptible to the overconfidence issue for a softmax classifier.ReAct (Sun et al., 2021) suggests truncating the high activations of the penultimate layer to distinguish distinctive patterns arising when OOD data is fed into the model.DICE (Sun and Li, 2022) is a sparsification technique that ranks weights by contribution, and then uses the most significant weights to reduce noisy signals in OOD data.
Except for MSP and Energy described above, other methods specify parameter(s) that must be tuned on a reserved OOD subset.However, in many real-world deployment settings, the distribution of entire OOD samples is usually unknown.

Training a Rejection Network
Outlier Exposure (OE; Hendrycks et al., 2019) uses auxiliary datasets completely disjoint from the test time data to teach the model a representation for ID/OOD distinctions.However, in real-world applications, OE has a limitation in that collecting all possible OOD samples is not feasible, and OOD samples may not be known a priori.K-Folden (Li et al., 2021) is an ensemble method that trains K individual classification models.Each model is trained on a subset with K − 1 classes with the remaining class masked unknown (OOD) to the model.They train each model with a cross-entropy loss for the visible K − 1 labels and an additional Kullback-Leibler (KL) divergence loss enforcing uniform predictions on the left-one-out label.For a test time, they simply average the probability distributions produced by these K models and treat the result as the final probability estimate for a test sample.However, the K-Folden lacks scalability to tasks with large classes and requires excessive computational costs because it requires K network instances.Moreover, their approach cannot be applied to a binary classification task (i.e., K = 2).
Compared to these studies, our method does not require the actual real-world OOD dataset and only trains a single additional rejection network.

Feature Representation Learning
Contrastive representation learning has shown remarkable performance for both ID classification and OOD detection (Khosla et al., 2020;Zhou et al., 2022).Compared to the contrastive loss used in self-supervised representation learning (Chen et al., 2020), where a model learns the general features of a dataset without labels, Khosla et al. (2020) suggest a supervised-contrastive loss (SCL), instances of the same class form a dense cluster on the model's feature space, whereas instances of different classes are encouraged to be distant from each other.Motivated by Khosla et al. (2020), Zhou et al. (2021) propose the margin-based contrastive loss (MCL) to better increase the discrepancy of the representations for ID instances from different classes.MCL enforces the L2 distances of samples from the same class to be as small as possible, whereas the L2 distances of samples from different classes to be larger than a margin.They show that the model learned the intra-class compactness achieves advanced OOD detection performance.Compared to MCL (Zhou et al., 2021) used only K ID classes, we modify MCL by assigning a pseudo OOD set to the (K + 1) th OOD class in the contrastive loss.Thus, our variant version of MCL not only shrinks the manifold of the OOD samples in the feature space but also further maximizes the discrepancy of the representations for ID instances from the surrogate OOD classes.

Method
Given Transformer-based PLMs with the softmax classifier, we propose a simple but effective method for detecting OOD samples.We first introduce the proposed method for generating surrogate OOD data and then present a rejection network that is trained with ID and the surrogate OOD.
Notation.Let x ∈ X ID be a training set, and y ∈ Y = {1, ..., K} be a label.For multi-class classification tasks, BERT-style Transformer f can be decomposed by the attention blocks and the last dense layer.We denote each layer as f att , and f out , respectively.Unless otherwise mentioned, the output of f att (•) denotes the [CLS] feature vector on the last attention block.In the text boxes, the blue words denote tokens with high attention scores, and the darker words represent higher attention scores than others.

Out-of-Distribution Set Construction
High-level idea.Following Lee et al. (2018b), we assume that class-conditional features on the PLM's penultimate layer (i.e., the last attention layer) follow the multivariate Gaussian distribution for the training set.We first calculate the empirical class mean and covariance of the training set.The former is defined as: where N k is the number of samples with class k.The latter can be calculated by: (2) Because our aim is to create surrogate OOD samples that are sufficiently close to the manifold of ID samples, a surrogate OOD sample x would be satisfied the following condition: where x ′ ∈ X OOD is an explicit OOD sample (e.g., an OOD sample comes from completely different ID tasks), and M (•) is the Mahalanobis distance between x and the closest class-conditional Gaussian distribution, i.e., (4) Considering Eq. 3, we construct the surrogate OOD sample from the training sample, i.e., x train → x.To obtain the OOD data with a similar semantic representation to ID samples, we gradually erase tokens with high attention scores until x has a larger Mahalanobis distance than the maximum ID Mahalanobis distance.It can be interpreted that the surrogate OOD sample starts with an ID and gradually turns into OOD as distinct tokens are erased.Data construction pipeline.Let x = {x 1 , ..., x S } is the training sample where S is its sequence length, and x s is the s th token.In the PLM's architecture, we can identify key tokens that mainly affect to the model predictions by leveraging the attention score corresponding to the position of [CLS] token.Using the attention score for each token, we can easily remove these tokens for any training set; thus we construct x excluding tokens that are correlated with ID classes.
We gradually replace the attention score-based tokens with the [MASK] token for T (≤ S) steps using the attention score: where A(•) is the token replacement function, and s * is the index where the token with the t th highest attention score is located.
For each step, we calculate M (x t ), and select xt * at the t * when M (x t * ) becomes greater than max i:y i =k M (x i ID ).For all training samples, we collect the surrogate OOD samples generated by the above process (see Fig. 1).

Rejection Network
The task of detecting OOD samples is a binary hypothesis test where f ′ is a decision model.
i=1 is a batch of training instances, and ỹi is assigned to OOD class K + 1.The B I denotes the size of a batch containing only ID samples, and the B O denotes the size of a batch containing only our synthesized OOD samples.We denote A(i) = {1, ..., B}\{i} is the set of all anchor instances for the batch samples.
The MCL with K + 1 classes can be formulated as, where d is the feature dimension of f att (x), L p is the positive loss term that enforces the L2-distances of instances from the same class to be small, and L n is the negative loss term that encourages the L2-distances of instances from different classes to be larger than a margin ξ.L p is calculated by, is the set of indices for the instances from the same class as y ′ i .The negative loss term is defined as In Eq. 9, is the set of indices for the instances from different classes with y ′ i .φ(•) is the ReLU function.The margin ξ is defined as the maximum distance between positive pairs, In conclusion, we re-train f ′ with the following objective, L total = L ce + L margin , where L ce is the cross-entropy loss.We use L ce the same as the loss for ID class classification in order to (1) without changing the output node of f ′ out and (2) to apply the existing post-hoc methods without modification.
In addition, during re-training, the [MASK] token of x is randomly replaced with a word in the PLM's vocabulary so that the model learns about various OOD representations.

Out-of-Distribution Scoring Rules
We use the existing OOD scoring algorithm, which maps the outputs of the model for test samples to OOD detection scores.The low score indicates a low likelihood of being OOD.Our rejection network can be applied to existing post-hoc methods, and we combine three parameter-free methods with our method in this work.

Dataset
In order to demonstrate the effectiveness of our method, we conduct experiments on common benchmarks for the OOD detection task: • CLINC FULL is a user intent classification dataset designed for OOD detection, which consists of 150 intent classes from 10 domains.This dataset includes 22.5k ID utterances and 1.2k OOD utterances (CLINC OOD ).
• CLINC SMALL is the variant version of the CLINC FULL dataset, in which there are only 50 training utterances per each ID class.This dataset includes 15k ID utterances and 1.2k OOD utterances.
Recently, in the field of NLP, Arora et al. ( 2021) categorize OOD samples by two types of distribution shifts: semantic and background shifts.Because the shifted benchmarks share a common ID text style (background) or content (semantic), these distribution shifts in such near-OOD detection problems are more subtle in comparison to arbitrary ID and OOD dataset pairs (e.g., training and OOD sets come from completely different tasks), and thus, are harder to detect.We also conduct experiments with semantic shift and background shift benchmarks to verify that POE is effective even with challenging ID/OOD pairs.
The semantic shift benchmark we used is as follows: • NEWS TOP5 is the rebuilt version of the News Category dataset (Misra, 2018) for OOD detection.NEWS TOP5 contains instances from the five most common classes of the News Category dataset, and the data from the remaining 36 classes are used as OOD (NEWS REST ).
• As a result, this changes the distribution of semantic features with high a correlation to ID labels.We use the IMDB as ID and c-IMDB as OOD.
For evaluating POE on the background shift, we use the SST2 (Socher et al., 2013) and Yelp Polarity (Zhang et al., 2015) binary sentiment analysis datasets.The SST2 consists of movie reviews, whereas the Yelp polarity dataset contains reviews for different businesses, representing a domain shift from SST2.These datasets are used as ID/OOD pairs (i.e., SST2/Yelp and Yelp/SST2) in our experiments.The data statistics are described in Tab. 1.

Evaluation Metrics
The OOD detection performance is measured with respect to the following standard criteria.
• AUROC is the area under the receiver operating characteristic curve obtained by varying the operating point.Higher is better.
• FPR@95TPR (FPR) is the probability that an OOD (negative) example is classified as a positive when the true positive rate (TPR) is as high as 95%.Lower is better.

Training Details
Two PLMs are used to compare a wide variety of algorithms: BERT-uncased-base (Devlin et al., 2019) and RoBERTa-base (Liu et al., 2019).The PLMs are optimized with AdamW (Loshchilov and Hutter, 2018), the weight decay of 0.01, and the learning rate of 2e-5.We use a batch size of 16 and fine-tune the PLM for 10 epochs on the downstream task.We compare our method with six post-hoc methods: MSP, ODIN, Mahalanobis (Maha), Energy, ReAct, and DICE.As the orthogonal research, contrastive learning methods that efficiently learn informative feature representations are well-suited for OOD detection.In our work, the recently proposed KNN-Contrastive Learning (KNN; Zhou et al., 2022), Supervised-Contrastive Learning (SCL; Khosla et al., 2020), and Margin-based Contrastive Learning (MCL; Zhou et al., 2021) are also compared.The hyperparameters of compared contrastive learning methods are followed the original work as possible for a fair comparison.For the posthoc methods, excluding parameter-free methods, we report the best OOD detection performance by varying their hyperparameters and adopting their best settings on the test ID/OOD pairs.The detailed hyperparameter settings are reported in Tab. 2.

Result
In this section, we present comprehensive experimental evaluations of POE.We compare POE with baselines for CLINC datasets (Sec.5.1), followed by empirical results for semantic and background  shift benchmarks (Sec.5.2) and detailed analysis (Sec.5.4).Due to the space limitation, we evaluate all methods based on RoBERTa in this section, and the experimental results based on BERT are reported in Appendix.

Result for CLINC datasets
The results in CLINC FULL and CLINC SMALL are presented in Tab. 3, where the best results for each block are highlighted in bold.Specifically, KNN (Zhou et al., 2022) uses the LOF algorithm (Breunig et al., 2000) as an OOD scoring rule, in which they use two basic distances to calculate the LOF score.We denote KNN using Euclidean distance as KNN-euclidean and using cosine distance as KNN-cosine, respectively.As shown in Tab. 3, POE outperforms all considered baselines on most ID and OOD distribution pairs on CLINC datasets, even though our method never requires access to real OOD data, unlike ODIN, ReAct, and DICE.Moreover, POE generally performs much better than other contrastive learning methods, especially on the CLINC SMALL which has a small size of training samples (50 instances per class).This empirical result shows that even if the rejection network is trained with the surrogate OOD set using small number of training samples and it shows the robust performance.

Result for Distribution Shift Benchmarks
We also conduct the distribution shift experiment using two types of shifted OOD benchmarks to verify that our method can detect the challenging OOD samples successfully.Tab. 4 shows OOD detection results for the background and semantic shifts, and the best results are highlighted in bold.
As shown in Tab. 4, interestingly, we observe that not only MSP but also the SCL and MCL struggle with these challenging OOD data.For example, on at least one ID/OOD pair (underlined entries), the naive MSP outperforms SCL+MSP and MCL+MSP except for POE+MSP.In contrast, POE more accurately detects distributionally shifted instances compared to baselines.Especially, POE performs the best with the Mahalanobis distance for both background and semantic shifts.

Ablation Study
Recall that the [MASK] token of x is randomly replaced with a word in the PLM's vocabulary for training the rejection network.We also assess how the replacement technique affects OOD detection performance (see Tab. 5).We observe that using the replacement technique brings additional performance gain by exposing diverse OOD representations to the rejection network.
To investigate the promising design choices of training objectives, we conduct an ablation study by applying each training objective to the rejection network as shown in Tab. 6.The CE+KL can be another choice for training the rejection network, which is an additional KL penalty enforcing uniform predictions on the surrogate samples generated by POE, i.e., L KL = KL(f ′ (x), U), where U is the uniform distribution over K classes.Overall, the rejection network is well-suited with a contrastive loss, and CE+MCL shows the best performances for all datasets.Different from the KL loss, which can not impose any constraints on the distribution of the rejection network's inner representation of the given data, the rejection network with the contrastive loss learns the intra-class compactness for both ID and OOD classes, and it further separates the inter-class distances.We believe that this discriminative feature space introduced by the contrastive loss leads to better OOD detection performance.Classification Performance.When the post-hoc method is applied to the PLM trained on the downstream task, classification accuracy is maintained because its weights do not change.However, the accuracy may not be preserved when the weights of PLM are fine-tuned using a contrastive loss.

Analysis
We evaluate the PLM trained with the contrastive loss on the six ID datasets.The experimental results are shown in Tab. 7. We observe that contrastive losses do not significantly reduce or increase the classification performance, which is similar to the observations by Zhou et al. (2021).Analysis of the Surrogate OOD Set.To examine how closely the surrogate OOD samples lie in the ID manifold, we measure the Mahalanobis distance between ID and the surrogate OOD introduced by POE (Tab.8).The RoBERTa is trained with the cross-entropy (CE) loss on the ID dataset and we calculate the Mahalanobis distance (Eq.4) at the RoBERTa's penultimate layer.We observe that the surrogate OOD samples produced by POE indeed have similar representations to ID samples.For example, in the feature space of the RoBERTa trained on CLINC SMALL , the Mahalanobis distance between the surrogate OOD samples and the conditional Gaussian distribution for CLINC SMALL has the closest distance to the ID manifold.For the background (SST2) and semantic shift (IMDB) benchmarks, the IMDB and c-IMDB each has the most similar representation of paired ID set.However, the XSST2 , and XIMDB are also sufficiently closed to SST2 and IMDB, respectively.
We also assess whether surrogate OOD samples, which have similar representations to ID samples, are most effective for OOD detection.In our OOD construction, for all training samples, we collect xt * ∈ X when M (x t * ) becomes greater than max i:y i =k M (x i ID ).Therefore, as increases, OOD samples that are semantically distant from the ID dataset can be generated.In Fig. 2, we report POE+Maha's OOD detection performances with varying levels of T * .We identify that surrogate OOD samples produced by a larger T * further away from the ID samples are generated (Left in Fig. 2).This trait is desirable as ID discriminative tokens are more erased in the surrogate sample.Moreover, we observe that POE+Maha with surrogate OOD sets introduced by T * achieves the best AUROC scores for all datasets, whereas the OOD detection performance deteriorates when the reject network is trained with a set of OODs far from the ID.This empirical result shows that (1) POE leverages the simplicity of erasing attention-based tokens, but it is possible to generate pseudo OOD samples close to the distribution of ID, and (2) these OOD samples are effective in training the rejection network.

Conclusion
In this paper, we propose a simple and intuitive OOD construction to train a rejection network.Motivated by the previous observation that OOD samples are most effective when semantically similar to ID samples, POE detects and erases tokens with high attention scores of PLMs.Its resultant surrogate OOD dataset is close to the distribution of ID samples that have been observed to improve the OOD detection performance of the rejection network.Extensive experiments conducted on challenging ID/OOD pairs show POE's competitiveness.

Limitation
Although the proposed method achieves significantly improved OOD detection performances compared to the baselines, but POE can not be applied to a naive LSTM, and RNN because our OOD construction is based on an attention score of the PLM.We leave this issue for future work, but we believe that our proposed method can be used in various NLP tasks as PLMs are now adopted in most fields of NLP tasks.While we adopted a masking method using attention scores in this paper, it is not clear that tokens with high attention scores have the most direct impact on the model's predictions (Wiegreffe and Pinter, 2019).To provide readers with more information, we include additional experimental results in the Appendix to discuss the impact of different masking strategies on OOD detection performance.

Ethics Statement
The reliability of language models is crucial to the stable deployment of real-world NLP applications.For example, the computer-aided resume recommendation system and neural conversational AI should provide trustworthy predictions because they are intimately related to the issue of trust in new technologies.In this paper, we propose a simple but effective method called POE for OOD detection tasks.We introduce a novel OOD construction pipeline without any external OOD samples to train a rejection network.We hope our work to provide researchers with a new methodological perspective.

A Additional Result
Empirical Result for BERT.We report empirical results for BERT in Table 9 and Table 10.
Comparison with other masking strategies.To provide readers with more information, we compare attention score-based masking with leave-oneout (LOO) method (Wiegreffe and Pinter, 2019).In

Figure 1 :
Figure1: An illustration of our surrogate data generation method.In the text boxes, the blue words denote tokens with high attention scores, and the darker words represent higher attention scores than others.
IMDB (Maas et al., 2011) is a binary sentiment classification dataset consisting of movie reviews.Kaushik et al. (2020) construct a set of augmented IMDB samples (c-IMDB) by editing IMDB examples to yield counterfactual labels.

Figure 2 :
Figure 2: POE+Maha's performances with varying levels of T * .The low Mahalanobis distance implies low similarity between ID and OOD samples.
In order for f ′ to learn the distinctive patterns between ID and OOD samples, we re-train the PLM f ′ on both ID and the , 2021).Different from MCL, which uses only ID classes, our variant version of MCL contrasts OOD instances to those from different ID classes.Let {x

Table 1 :
based scoring rule is defined as log K k=1 exp(f k (x)).• Mahalanobis (Maha).Lee et al. (2018b) propose the Mahalanobis distance-based scoring rule, but their method requires several hyperparameters should be tuned via a real OOD subset.Instead, following Zhou et al. (2021), we use the parameter-free Mahalanobis distance as a scoring rule: max k −(f att (x) − μk ) ⊤ Σ−1 (f att (x) − μk ).Unless otherwise mentioned, we use this scoring rule in our experiments.Data statistics for the six text classification datasets used for our experiments.

Table 3 :
Comparison results for the CLINIC datasets.We adopt RoBERTa as a baseline architecture for the experiments.Results are percentages.

Table 4 :
Comparison with state-of-the-art methods.All implementations use RoBERTa.

Table 5 :
Effect of the replacement technique, which augments the surrogate OOD sample by replacing masked tokens with randomly selected tokens.The OOD detection performance is based on POE+Maha.

Table 6 :
Ablation study assessing training objectives.We use the Mahalanobis as an OOD scoring rule.

Table 7 :
ID classification accuracies for contrastive learning methods.

Table 8 :
Averaged Mahalanobis distance between ID training samples and target datasets.We report the distance multiplied by 10 −3 , and the higher value indicates that the target dataset is closer to ID samples.We underline values when the target dataset is an ID test set.

Table 11 ,
both attention-based masking and LOO are effective for the OOD detection task.However, attention-based masking has the advantage of being computationally efficient, as masking priorities can be obtained in a single forward pass.In contrast, LOO is computationally inefficient because it must remove each token in the input sentence one by one to verify the model predictions.

Table 9 :
Comparison results based on BERT.For all methods, we report AUROC (%) scores.The best results are highlighted in bold.

Table 10 :
The OOD detection results based on BERT.Each value indicates the FPR (%) score.

Table 11 :
Comparison result for different masking strategies using RoBERTa.Each value indicates the AUROC (%) score and the best results are highlighted in bold.